We will pick up where we left off in Demo 1 to prepare you for analysis. Although this is not going to be as exhaustive as we would like to do if we were doing a complete project, our goal is to see this project go from messy data \(\rightarrow\) prediction & interpretation. After all, we need to give something to the patient advocacy watchdog that they can deploy!
Our goal is to see this analysis through to generate a predictive model that the advocacy group can use, and hopefully it will be a decent one. To ensure that it stands the best chance of being a good predictive model, our analysis plan includes all the key steps in the data science life cycle/project flow, including: cleaning, exploration, feature selection, pre-processing, unsupervised machine learning exploration, and supervised modeling. Whew!
Our specific plan is to perform two big analyses in the hopes of having a nice deliverable to present to our client. The first will be a segmentation analysis, or clustering analysis, which will attempt to group our hospitals together on the basis of their attributes and their pneumonia-related readmissions. Thie will be done using a combination of PCA and \(k\)-means. Our second analysis will be to perform supervised, predictive analysis using an elastic net, which conveniently combines a penalized regression with variable selection so we can point our clients down a path of hospital attributes to focus on, as well as a model they can use to make future predictions.
Your oject is to work through this Project 1. You will have 31 Questions to answer as you work through this, just as with Project 1. Then, you will be able to choose your own adventure again to perform your own clinical informatics analysis:
[Minimal coding, focus is on understanding cleaning, pre-processing, and the ML analyses we performed here.] Choose ONE other condition (anything other than pneumonia) and run through the analyses again, using our same target variable. You will update your question, hypotheses, and predictions, but will be able to re-use all of the other code with very minor modifications. Credit will be based more on interpretation and not on coding. See more information on Adventure 1 here.
[More coding, but not as much, with a focus on the steps
we’ve undertaken here and extending it to a new question &
condition.] Choose a slightly more complicated analysis to
undertake, for example, focusing on surgical interventions
(HIP-KNEE & CABG) or heart-related
conditions (HF, AMI, & CABG).
Coding should still be fairly minimal, but you are likely to run into
problems with the code working exactly as-is, especially during
cleaning. The advantage of this analysis is that it will be much more
robust, have larger sample sizes, and would allow us to be able to say
something far more informative to a subset of patients (e.g., those
considering undergoing surgical interventions). See more information on
Adventure 2 here.
[Most coding, but no cleaning or pre-processing, with an emphasis entirely on the machine learning analyses.] Continue with the pneumonia-related data that you have here, completely cleaned and pre-processed, and do two more unsupervised OR supervised analyses. For example, you could do another kNN using the \(k\) we identify in our Segmentation Analysis here, or you could do a random forest to account for any possible non-linearity. You could even choose to do a Ridge and LASSO regression and compare those analyses to what we get out of an Elastic Net. Any of these are fine, but you must explain your decision in your submission. See more information on Adventure 3 here.
This is the first of your choices in your adventure, and what you choose here could either lead to success or… sudden death (dum, dum, dummmm…). Choose wisely, young data scientists!
Load data from the end of Demo 1, which includes the ordinal and frequency encoding you did (plus a little bonus cleaning to help).
Load my version of the fully cleaned, encoded, & ready-to-proceed dataset from the end of Demo 1. These data are not yet ready for analysis (!) but we would be picking up with that here.
## Uncomment if this is your choice:
# load("FY2024_data_files/pneumoniaAnalyzeFullyEncoded2024.Rdata")
# dat2Analyze <- pneumoniaAnalyzeFullyEncoded
Load data from Demo 1 without ANY encoding, but all other tidying the same.
## Comment out if this is NOT your choice:
load("FY2024_data_files/pneumoniaAnalyzeNoEncoding2024.Rdata")
dat2Analyze <- pneumoniaAnalyzeNoEncoding
Which dataset do you choose and why?
Explain your selection here. I choose the data set Option 2 (to load the data without the encoding) because loading this data will eliminate any changes that resulted from the previous encoding, while ensuring the data is still cleaned and available to be encoded by our new code from this project.
R like an object-oriented
languageMany of you still may be early enough in your programming journeys
that you haven’t yet learned how to fully leverage the
objected-oriented nature of programming languages like
Python. R is not technically an
object-oriented language in the same way, but we can hack it to function
like one. That’s exactly what we are going to do here by loading a
helper function directly from the source
code.
We went through all the trouble in the Demo to perform encoding and
now we get to use it! You will use the source() function to
load a helper function that I have written, using my answer to the
ordinal encoding question of the Demo, so that you can do all your
ordinal encoding in a single line of code. Neat,
huh?
BUT WHY are we doing the encoding now if we just chose to bring in non-encoded data? Because we can do ordinal encoding before or after data partitioning with impunity. It does not introduce data leakage to do so here, so we can do it in either order. This means that we can do it now simply for convenience.
NOTE: If you chose Option 1, the fully-encoded
dataset, you can still run this chunk because there is a conditional in
here that will not run your dataset since it is already encoded! But
still make sure you understand the idea of source code
because we will keep using it throughout the course!
library(dplyr)
## This calls the code in the associated .R file
source(file = "doOrdinalEncoding.R")
## This checks to see if you loaded the already encoded dataset
if(!exists("pneumoniaAnalyzeFullyEncoded")) {
## This sets a list of PATTERNS to match for the columns to ordinal encode
encodeList <- c("ComparedToNational_",
"PaymentCategory",
"ValueOfCareCategory",
"Score_Emergency department volume")
## This runs the function, which takes 3 arguments
## Open the source code to see what each argument does!
dat2Analyze <- doOrdinalEncoding(df = dat2Analyze,
encodeList = encodeList,
quiet = FALSE)
}
NOTE: What follows is my (long-winded) answer to Question 19 from the Demo.
I am going to quickly break down the differences between the possible targets so we can make sure we really understand what they are measuring to allow us to make our most-informed selection of a target. Note: More information was found in this PDF. This is really important for helping us choose the right one!
Notice that I am also including a variable we didn’t look at
previously called
ComparedToNational_Hospital return days for pneumonia patients.
This is because this is our best assessment of how the hospital is
performing with regard to pneumonia readmissions relative to a national
average. However, even though I’m including it in the table, I think
it’s better to think of it as a solid predictor. It’s
telling us how length of stay (LOS) in the hospital
after pneumonia readmission stacks up compared to the national average.
After a hospital readmission, it’s generally better to have
fewer return days rather than more because it suggests better
quality of care, more timely follow-up, and a greater chance of the
patient recovering well at home. Since these are readmissions, having
more days than average could also suggest that the hospital
missed the severity or didn’t properly plan for the patient’s recovery
at home and released them too early.
You may be wondering - could we use it as a target? Sure - but it’s
not for the question we were posed by our stakeholder! Note that, if we
did, we would need to perform a multinomial classification analysis. So,
instead, let’s use this category to help us contextualize our hospitals
as we choose between Predicted, Observed, and
Expected readmission rates, as well as
Excess Readmission Ratio.
| Possible Target | Description |
|---|---|
| ComparedToNational_Hospital return days for pneumonia patients | Hospital return days measures the number of days patients spent back in the hospital (in the emergency department, under observation, or in an inpatient unit) within 30 days after they were first treated and released for pneumonia. Reported as compared to the national average, such that ‘below average’ is better and ‘above average’ is worse. |
| ExpectedReadmissionRate | The expected number of readmissions in each hospital is estimated using its patient mix and an average hospital-specific intercept. It is thus indirectly standardized to other hospitals with similar case and patient mixes. |
| PredictedReadmissionRate | The number of readmissions within 30 days predicted based on the hospital’s performance with its observed case mix. The predicted number of readmissions is estimated using a hospital-specific intercept, and is intended to reflect the annual expected performance of the hospital given its historical case and patient mix and performance. |
| ExcessReadmissionRatio | The ratio of the predicted readmission rate to the expected readmission rate, based on an average hospital with similar patients. Performance is compared against a ratio of one, such that below one is better and above one is worse in terms of readmission rates. |
| observed_readmission_rate | Our calculation of the observed number of pneumonia-related readmissions within 30 days found by dividing the number of readmissions observed for the hospital during the given period by the number of discharges for that same period, and multiplied by 100 to put it on the same scale as the Predicted and Expected Readmission Rates. It reflects a true observation as reported by the hospital during that period, but is not adjusted for case mixes or prior information for that hospital. Thus, this is a crude, unadjusted value. |
Now, I know I didn’t ask you to do it, but I will actually begin my assessment of which target is most appropriate by constructing a quick summary table. In my assessment of what is an appropriate target,
| Characteristic | N = 4,8161 |
|---|---|
| National Comparison Return Hospital Days | |
| Better than average | 402 / 4,816 (8.3%) |
| Same as average | 2,285 / 4,816 (47%) |
| Worse than average | 827 / 4,816 (17%) |
| Unknown | 1,302 / 4,816 (27%) |
| Predicted Readmission Rate | 16.46 (1.93), 16.27 |
| (Missing) | 2,090 |
| Observed Readmission Rate | 17.1 (3.7), 16.9 |
| (Missing) | 2,603 |
| ExcessReadmissionRatio | 1.00 (0.06), 1.00 |
| (Missing) | 2,090 |
| Expected Readmission Rate | 16.43 (1.51), 16.39 |
| (Missing) | 2,090 |
| 1 n / N (%); Mean (SD), Median | |
We can see right away that we lose nearly 600 more data
points if we opt to use our calculated
observed_readmission_ratio (NOTE: nearly 500 if using the
2025 FY data), which is going to be a by-product of the fact that the
Predicted and Expected readmission ratios are
using prior information and information about patient mixes to generate
these values. Our observed_readmission_ratio is crude and
unadjusted any way - which could present problems when it comes to true
comparability. So, off the bat it feels as if the constructed
target is possibly not a good choice despite its simplicity for
stakeholders to understand.
ComparedToNational_Hospital return days for pneumonia patientsAlthough
ComparedToNational_Hospital return days for pneumonia patients
is very compelling, it’s better as a predictor rather
than a target because it gets at an aspect of hospital performance: once
readmitted due to pneumonia, how long (as compared to the national
average) are patients spending in the hospital? This means it makes a
really nice predictor to use to compare our four possible choices
for pneumonia-related hospital readmissions. We are looking for a
clear separation in our ideal target between hospitals that have lower
readmission rates AND also tend to do better than the national average
at healing their readmitted patients!
So, this makes a really nice variable I could choose to
facet plots, which means creating sub-plots that show
up as panels. The thing is, if I know I want to facet
plots, the first thing I have to do is reshape the data I want from
wide into long format using, for example, the
pivot_longer() function (the complement of
pivot_wider() that we used in the Demo).
I don’t have to store it as a separate data frame (i.e., I can just
pipe it directly into ggplot()!) but here I am choosing to
so we can more easily make sure it’s reshaped correctly along the way.
Thus, I am making a long-format version of the possible target variables
called longTargets so that I can better decide which target
to choose.
I mentioned violin and density plots
specifically in Question 19 of Demo 1, as well as faceting. I am going
to show you two ways to explore these data with these types of plots and
you can decide which one(s) you find most informative here. You also may
have thought of a better way to display the data too!
geom_violin():Here I am leaving the NA data (which I relabelled as
Unknown) to get a sense for the degree of missingness.
Figure 1. Readmission Rates vs. readmitted hospital LOS
If you have never encountered violin plots before, they are honestly a wonderful way to assess the symmetry and skew of a distribution while more easily enabling comparisons between groups. They’re effectively a hybrid between a histogram and a boxplot, and we can even opt to overlay the boxplot quartiles in the plots as well, which is particularly handy!
Your task is to do just that by adding
geom_boxplot(width = 0.1) (yes, just that one line!) in the
appropriate place. Also change include = FALSE to
include = TRUE in the R chunk header so that it shows up
when you knit!
Make sure to comment your line to call attention to it:
Figure 2. Readmission Rates vs. readmitted hospital LOS
## NULL
Now that you can compare these groups more easily, what we’re looking for is a balance between clear difference between hospital ordinality relative to the national average (i.e., “Are better hospitals really better in pneumonia readmission rates?”).
Which target(s) appear the most distinct to you? > Your answer here. The target ‘Better’ appears to be the most distinct, due to its lack of outliers and a more consistent (compact) shape of the violen plot centered around the mean from the boxplot. The ‘Unknown’ target is also distinct, in thata its pattern is noticably different than the three other targets (which makes sense given our lack of data).
geom_density():We could also choose to overlay density plots instead of violin or boxplots, as this can sometimes help us see subtler separation of groups than we can see otherwise. This time, I am also going to ignore the “Unknown” (i.e., missing) class of hospitals to prevent things from getting too messy!
Does this view confirm or change your decision from Question 3?
Your answer here. It changes my decision, as the Excess Readmission Ratio is the one graph that has a clear delineation between the ‘Better’, ‘Same’, and ‘Worse’ variables.
Going forward, I have selected PredictedReadmissionRate
as our target. In a more thorough analysis, we would likely want to
consider:
Circling back to our stakeholder to try to refine our question or target using domain knowledge
Perform experiments with our models to find the best predictive model since that is our ultimate goal.
However, given that we do not have the option or time for that, let
me explain my choice. It’s the best balance of
distinguishing hospitals with better and worse readmitted hospital
stays, while still providing an adjusted,
hospital-specific measure. Adjusted values will generally
always be a better choice than the crude, even if our crude measure
captures something interesting about the data, as it did here, or if
it’s easier for our stakeholder to comprehend. Our crude measure,
observed_readmission_rate did the best job
of recapitulating the readmitted hospital LOS
(ComparedToNational_Hospital return days for pneumonia patients),
but if that’s really our focus, wouldn’t it then be better to switch
from a readmission rate to that as our target? Probably.
If you still feel unsure about what is the best choice, you’re not
alone - so do I! This is where, because we are working with an
imagined stakeholder, we must make the best decision we
can given the question we defined in the Demo. I would defend my choice
of Predicted Readmission Rate for three key reasons:
It has been adjusted to be hospital-specific, using a hospital-specific intercept and historic data about the hospitals’ performances
We don’t lose nearly 600 data points to missingness.
Our imagined stakeholder, and the question we set, was about
readmission rate so even though readmitted
return hospital days might also be a similarly viable
candidate, it doesn’t reflect our imagined stakeholder’s request. Note
that the ExpectedReadmissionRate is standardized for
similar hospitals, and may not truly reflect a given hospital; the
ExcessReadmissionRatio, while informative, was the least
likely to reflect the distinctiveness in hospital return days, which is
still of interest to us.
But remember that what matters most is that we are matching our question.
Do you agree with my choice of target? Why or why not? (You are free to disagree!)
I understand the choice of target when it is looked in comparison to the EDA I conducted in the previous question but I do think that the ‘Excess Readmission Ratio’ does a better job of representing all of the date and limiting influence from confounding variables by taking into account multiple variables.
All studies have bias (introduced or systemic error) and limitations (failure to fully explain something). We could go more deeply into these concepts, but for now I want you to take a stab and just brainstorming either ways we might have bias OR the limitations of using this target variable. You do not need to answer both, but you’re welcome to do so. You can read more about bias here or see some deeper explanations of limitations here. You do not need to write a lot; just 2-3 sentences. I am simply asking you to pause and reflect on our choices here before we proceed.
Your answer here. The biggest limitation I can brainstorm for ‘Predicted Readmission Rate’ is the fact that is is not an actual observed value but a projected one. It is also limited in that it cannot act as a stand-alone variable, in that its performance as a variable is more highly linked to its comparison to other variables (i.e. variables that would affect readmission, such as cleanliness and staff responsiveness).
The time has finally come to officially move back into pre-processing and feature selection so that we can finally build some models. I want you to take note of the order in which I’m doing things here; the flow can change from project to project, but generally it’s a good idea to get your dataset to a state of manually curated features before making your data partition.
It’s time to say goodbye to the features we know would be far too
redundant or non-informative given our target of
PredictedReadmissionRate. We don’t need any of the
Sample_ columns, for example; these are merely sample
sizes. Useful if we were doing standardization, but not necessary for us
right now. We also are dropping the other possible target features that
are too redundant with PredictedReadmissionRate.
Add your comments to the code as indicated to explain what I was doing at each of the steps.
Your answer in the comments.
dat2Analyze <- dat2Analyze %>%
## COMMENT HERE
# This code removes unnecessary variables to simplify the data set.
select(-ExpectedReadmissionRate,
-ExcessReadmissionRatio,
-observed_readmission_rate,
-contains(c("Sample_", "NumberOfPatients")),
-NumberSurveysCompleted) %>%
## COMMENT HERE
# This code renames certain variables, giving them shorter, easier to understand names.
rename(`Median time (minutes) patients spent in ED` =
`Score_Average (median) time patients spent in the emergency department before leaving from the visit A lower number of minutes is better`,
`Composite patient safety` = `Score_CMS Medicare PSI 90: Patient safety and adverse events composite`,
`ComparedToNational_Composite patient safety` = `ComparedToNational_CMS Medicare PSI 90: Patient safety and adverse events composite`) %>%
## COMMENT HERE
# This code selects the variable 'PredictedReadmissionRate' and movesit in front of everything so that is is at the beginning of the data set.
select(PredictedReadmissionRate, everything())
## COMMENT HERE
# This code is used to clean the column names and replace characters in the columns.
colnames(dat2Analyze) <- gsub('Score_|HcahpsLinearMeanValue_|_Payment', '', colnames(dat2Analyze))
Next, we are going to take advantage of the
nearZeroVar() function that is part of the
caret package. We will drop any of the columns with
near-zero variance. This is a way of automating (and thus standardizing)
the process for choosing to drop non-informative columns due to zero or
nearly zero variance.
If you are not familiar yet with the concept of zero or near-zero variance, do a little digging. Based on what you read, try to explain for your naive advocacy group stakeholders, why it’s important to remove the features with little variance.
Your answer here. Near zero variance is the statistical measure that identifies variables in a dataset as having few or only one output, creating an extremely low variance for the variable. In terms of looking at the data, these features do not show any relationships and are not very useful in answering larger questions.
| Variables Dropped |
|---|
| ComparedToNational_Postoperative respiratory failure rate |
| ComparedToNational_Perioperative pulmonary embolism or deep vein thrombosis rate |
Take note of the number of features here.
You may or may not have noticed that I moved FacilityID
to our rownames in the datasets you loaded today. Why did I choose to do
that? Why do I want to make sure unique identifiers, like
FacilityID, are out of the dataframe before I either (a)
run nearZeroVar() or (b) proceed with the rest of the
analysis?
Your answer here. ‘FacilityID’ is not a variable with a relationship to be seen but just a variable used to contain identifiers for the observations in the data set. Therefore, it will not be informative as a predictor in analyszing the variables relationships. Additionally, because every ID is different, this column has the most variance present in the data set so ‘nearZeroVar()’ does not work.
Do some research on, or lean into your previous knowledge, about the curse of dimensionality or the big-P, little-N (n >> P) problem. Using whatever you find or remember, can you justify why we likely do NOT have a big-P, little-N (n >> P) problem here? Make sure to explain.
Your answer here. The curse of dimensionality is a fun way of naming the problems that arise when you are working with an extremely large data set and a lot of information. One such problem within this curse umbrellanis the ‘big-P, little-N (n >> P) problem’ , which occurs when there is a significanyly larger amount of predictor variables when compared to the observations. We likely don’t have this problem because hospitals collect such large amounts of data that we have large amounts of information (and observations). We lso pre-processed the data, so there are already less predictors than we started with.
We are going to deal with imputation in a rather quick and elegant
way that requires minimal effort from us using a model-based approach
with missForest. You may recall from lecture that I said we
are choosing a model-based approached without uncertainty
because our ultimate goal is to create a deployable model for our
stakeholder.
Write some code that will quickly tally the number of columns in the
dataset that contain missing values. You will find
colSums() and is.na() functions likely useful
here. What percent of the features have at least some level of
missingness?
# Your code here.
#Counting how many columns have at least one NA
num_missing_cols <- sum(colSums(is.na(dat2Analyze)) > 0)
#Total number of columns in the dataset
total_cols <- ncol(dat2Analyze)
#Calculating the percentage of columns with missing values
percent_missing_cols <- (num_missing_cols / total_cols) * 100
num_missing_cols
## [1] 31
total_cols
## [1] 31
percent_missing_cols
## [1] 100
Your answer here. All of the columns (100%) havesome level of missingness.
Which column(s) are not missing any data?
Your answer here. None.
How many complete observations do we have in the
dataset? You may find nrow() with
complete.cases() or drop_na() useful!
#Getting rid of rows with any NA and counting what's left
num_complete_obs <- dat2Analyze %>% drop_na() %>% nrow()
num_complete_obs
## [1] 644
There are 644 complete observations in the data set.
md.pairs() function from the
mice package to assess the extent of missing values before
imputing.When we are missing a lot of data, as we are here, it can sometimes be useful to predetermine whether the patterns of missing correlate with our target. Why? We care if missing values correlate with the target because this may indicate that the missingness itself carries information. Thus, ignoring it could lead to biased models or low predictive power!
In fact, missingness can itself become a predictor.
If a missing value is associated with the outcome, it may carry
predictive signal. For example, in our data, hospitals with high rates
of missingness in Left before being seen highly associated
with pneumonia readmission might indicate a worrying pattern. Perhaps
hospitals are negligently failing to report how many of their patients
leave without being seen because of larger, systemic issues that ALSO
happen to impact overall quality and patient care. Thus, missingness
could itself become a predictor!
To do, I will leverage the md.pairs() function from the
mice package which calculates the pairwise correlations of
missingness between variables and allows us to visualize those patterns
using a clustered heatmap. When I do this, I will drop
State from the matrix because it will create errors if I
retain it. (Why?)
## Drop State and analyze the pairwise missingness using the mice package
miceMatrix <- dat2Analyze %>% select(-State) %>% md.pairs()
What md.mice() does is create two pairwise correlation
matrices, mr and mm, where:
mr = Missing vs. Responded
mm = Missing vs. Missing
Let’s take a look at the first 6 rows and 4 columns of the
mr matrix:
| PredictedReadmissionRate | MRSA Bacteremia | Payment for pneumonia patients | Abdomen CT Use of Contrast Material | |
|---|---|---|---|---|
| PredictedReadmissionRate | 0 | 48 | 630 | 1238 |
| MRSA Bacteremia | 1041 | 0 | 1650 | 2236 |
| Payment for pneumonia patients | 21 | 48 | 0 | 661 |
| Abdomen CT Use of Contrast Material | 37 | 42 | 69 | 0 |
| Death rate for pneumonia patients | 10 | 47 | 6 | 609 |
| Postoperative respiratory failure rate | 425 | 59 | 1042 | 1574 |
Any given cell in a matrix is always of the form \([i, j]\), where \(i\) is the row and \(j\) is the column - just as you are used to
with dataframes! So, in these correlation matrices of our pairwise
variables, cell \([1,1]\) is
PredictedReadmissionRate vs. itself, so the count is 0 - as
is every value on the diagonal, like with all correlation matrices. For
cell \([1,2]\)
PredictedReadmissionRate vs. , there are 42 observations
where variable PredictedReadmissionRate (\(i\)) is missing but
MRSA Bacteremia (\(j\))
was observed. Thus, we can extract the number of cases where
variable \(i\) is missing and variable
\(j\) is observed using
miceMatrixmr[i, j].
Now, let’s take a look at the first 6 rows and 4 columns of the
mm matrix now:
| PredictedReadmissionRate | MRSA Bacteremia | Payment for pneumonia patients | Abdomen CT Use of Contrast Material | |
|---|---|---|---|---|
| PredictedReadmissionRate | 2090 | 2042 | 1460 | 852 |
| MRSA Bacteremia | 2042 | 3083 | 1433 | 847 |
| Payment for pneumonia patients | 1460 | 1433 | 1481 | 820 |
| Abdomen CT Use of Contrast Material | 852 | 847 | 820 | 889 |
| Death rate for pneumonia patients | 1292 | 1255 | 1296 | 693 |
| Postoperative respiratory failure rate | 1788 | 2154 | 1171 | 639 |
Similarly, miceMatrix$mm[i, j] gives us the number of
cases where both variables \(i\) and
\(j\) were co-missing.
Explain what you think the number in cell \([6,1]\) means. Out of all the observations in the dataset, does this feel like a high degree of co-missingness?
Your answer here. Firstly, the numer in cells [6,1] joins PredictedReadmissionRate vs. Postoperative respiratory failure rate. I believe this number means that in 1788 observations in the data set, PredictedReadmissionRate was missing but Postoperative respiratory failure rate was present. This does seem to be a high degree of co-missingness because it is one of the higher co-missingness numbers with PredictedReadmissionRate.
Lastly, we can calculate a proportion of missingness vs. response between any two variables \(i\) and \(j\) such that:
\(Proportion_{i,j} = \frac{mr_{i,j}}{mr_{i,j} + mm_{i,j}}\)
In non-mathy speak, it’s the fraction of the times that variable
\(j\) is observed among rows where
\(i\) is missing. So, if you want to
impute variable \(i\) - let’s say it’s
MRSA Bacteremia - then you will be interested in the
variables typically observed (present in the dataset)
when \(i\) is missing. Those are the
variables that will be able to help us predict, or fill in, \(i\).
Figure 5. Correlation between Missing and Response
It may feel a little counterintuitive, but here darker colors indicate a higher correlation between missingness and response (proportions nearer to 1). In other words, these are the variables that are going to be more informative during imputation for a given variable.
But what if we wanted to know about patterns of missingness with
regard to our target, PredictedReadmissionRate? Well, we
could get that too!
Figure 6. Correlation between Missing and Missing for Target
What columns seem to be most correlated in terms of missingness to
our target, PredictedReadmissionRate? Does this give you
any cause for concern? Feel free to speculate as to why you might see
this.
Your answer here. The variable that seems to be most correlated to missingness with ‘PredictedReadmissionRate’ is ‘MRSA Bacteria’. This does give me cause for concern, as a bacterial infection would likely see readmission. If this data is missing, it’s hard to devise a plan that treats and controls form the spread of MRSA bacteria.
We will not be able to use those rows at all moving forward, sadly. That is going to reduce our sample size pretty sizably, from \(N = 4,816\) to only \(N = 2,726\) (or from \(N = 4,775\) to only \(N = 2,731\) .
## [1] "Rows: 2726" "Columns: 31"
Thinking back to your answer to Question 12, do you feel any differently given these dimensions of the dataset?
When the overall size of the dataset diminishes like that, the number of co-missing and missing-responded data points makes up a more substantial portion of the data. However, since we are removing all data with a missing PredictedReadmissionRate, the co-missing values for concerning data like MRSA Bacteremia will also be removed. Though the number of missing-responded data points with MRSA and PredictedReadmissionRate is still very high, similar to the rate of comissing values prior to removal.
At this stage, we are almost ready to impute - but not until we have done our data partitioning first! In fact, splitting our data into training and testing partitions first is really important to prevent data leakage.
Explain to your stakeholder, in simpler language, (1) why we partition data into testing and training sets and (2) what data leakage is. You do not need to go into great detail, but it should explain enough that they understand the basic ideas.
The training data will be used for exactly that, training the model. Think of the test set as data we don’t have yet. It’s essentially simulating future data collected or future scenarios you would implement this model on. You wouldn’t need a model if you could predict the future. Similarly, a model that is told the answer by being given the ‘future’ (test) data will be worse for your use case. It’s like if you’re cheating on a test: you don’t learn as much when you’re given the right answers from the start, so you’ll be worse off when you eventually get to something you haven’t been given the answers for.
You may recall that I mentioned that the optimal ratio is has been argued to be \(\sqrt{p} : 1\), where \(p\) is the number of parameters (which may or may not equal your predictors depending on your planned analysis). Note that you do not have to split only a single time; you could make different splits for different analyses, if needed. However, we are going to use a single split here for convenience.
Now, wouldn’t it be nice to have a handy function that could calculate the optimal split ratio for us? So, let’s write one! This is something you can use with ANY analysis going forward!
There are three parts to this question to get it to all come together correctly.
Comment the function to explain what it does (I’ve defined the arguments for you)
Run the function (fill in the blank)
Fill in the missing piece in the data partitioning chunk below
calcSplitRatio <- function(p = NA, df) {
## @p = the number of parameters. by default, if none are provided, the number of columns (predictors) in the dataset are used
## @df = the dataframe that will be used for the analysis
## This if statement is set up to check if the number of parameters is not provided.
if(is.na(p)) {
p <- ncol(df) -1 ## The if statement triggers this line if the number of parameters is not provided, defaulting to the number of columns in the dataframe minus 1 (the target variable). This does not account for a rownames column if that is in the dataset, so inputting p manually is best practice.
}
## This calculates the number of samples contained within the test set by multiplying the proportion of 1: sqrt(p) by the total sample size, represented by the number of rows in the data set.
test_N <- (1/sqrt(p))*nrow(dat2Analyze)
## This line calculates the proportion of the sample size that should be ascribed to the test set, rather than the raw number.
test_prop <- round((1/sqrt(p))*nrow(dat2Analyze)/nrow(dat2Analyze), 2)
## This subtracts the proportion ascribed to the test set from 1 to get the proportion ascribed to the training set.
train_prop <- 1-test_prop
## This prints out the proportion values neatly for interpretation sake.
print(paste0("The ideal split ratio is ", train_prop, ":", test_prop, " training:testing"))
## This outputs the training proportion as a value that can be stored in an object/etc.
return(train_prop)
}
## Fill in the blanks to run:
train_prop <- calcSplitRatio(df = dat2Analyze)
## [1] "The ideal split ratio is 0.82:0.18 training:testing"
Hint: One important note here is that if the number of parameters are not provided (which is true for the starter code I gave you to run it - it does NOT pass any parameters to that argument!), then by default our function is designed to take the number of columns of the dataframe and subtract one for the target variable. Neat, huh?
Now, uncomment this code and fill it out to run it.
ind <- createDataPartition(dat2Analyze$PredictedReadmissionRate, ## Put the target here!
p = train_prop, ## Use the object you just made!
list = FALSE)
train <- dat2Analyze[ind, ]
test <- dat2Analyze[-c(ind), ]
What split ratio did you get? Is it close to the canonical 80-20 split? Why do you think that happened?
I got a 0.82:0.18 training:testing split. This is very close to the 80-20 split. The 80-20 split is ‘canonical’ for a reason I presume, it is likely close to a lot of the ideal splits for datasets of a ‘typical size’ which I would argue this set is. I’d expect more deviation from that split as the sample size approaches the extremes.
Did you get stuck somewhere? If so, load the
test and train data in case you struggled with
Question 18 so you can proceed:
load(file = "FY2024_data_files/pneumoniaTrain.Rdata")
load(file = "FY2024_data_files/pneumoniaTest.Rdata")
missForest.We are going to take advantage of the missForest()
function from the missForest package. In a nutshell, what
missForest does is it will fit either a regression- or
classification-based random forest model and use the OOB (“out-of-bag”)
results to predict and fill in NAs. The advantage is that
this process is non-linear, done in just a few lines of code, and can
handle mixed data-types (although you need to make sure that all your
numerical types are of the same numerical type, and that all characters
or factors are of the same type).
missForestThe way I have chosen to do this is to separate the data so that I
can turn the encoded categories back into categories temporarily, then
stitch them back together into a temporary dataframe so that I can run
missForest. Notice that I temporarily drop the
State and target features; I drop State to
spare ourselves a little computational time and I drop
PredictedReadmissionRate to prevent data
leakage. Be patient!!. Since this is
randomForest, it could take a minute or two to run.
NOTE: If your machine fails to run this, there are
data you can load. Some computers may struggle to run this.
# Remove encoded categorical variables for imputation, plus the target and state (which have no missing)
#data_sans_encoded = train %>%
# select(-PredictedReadmissionRate,
# -State,
# -`Emergency department volume`,
# -contains("ComparedToNational"))
#data_cat = train %>%
# select(`Emergency department volume`,
# contains("ComparedToNational")) %>%
# mutate_all(as.factor)
# Stitch them back together:
#temp <- cbind(data_sans_encoded, data_cat)
# Impute missing values using missForest
#imputedTrain <- missForest(temp,
# variablewise = TRUE,
# verbose = FALSE)
NOTE: Did you catch that I not only removed
the target before imputation, but I dropped State too? This
is because we know State is/will be frequency encoded
(depending on our path), and it could create imputation issues as a
result. This leaves me with 29 columns to perform imputation
on.
load(file = "FY2024_data_files/imputedTrain.Rdata")
NOTE: If yours didn’t run, make sure to look at the
table output in the knitted HTML I provided you. Also, change the
include=FALSE in the chunk header if you were not able to
get missForest to run.
What does MSE (Mean Squared Error) and PCF
(Proportion of Falsely Classified) indicate here? You may find looking
at the missForest Vignette
helpful. Which variable(s) had the highest OOB error?
MSE represents the mean squared error between imputed and true values. MSE is an indicator of accuracy in imputation for numeric variables. PFC represents the proportion of falsely classified categorical variables. “Payment For Pneumonia Patients” had an astronomically high MSE compared to other variables. “Healthcare workers given influenza vaccination” and “Venous Thromboembolism Prophylaxis” had relatively high MSE as well, though three orders of magnitude lower than “Payment For Pneumonia Patients.” For categorical variables, “Emergency department volume” and “ComparedToNational_Hospital return days for pneumonia patients” had higher PFC values, with 0.47 and 0.36, respectively. This represents a substantial amount of false classifications.
NOTE 1: If the OOB error (either MSE or PCF) from
missForest is really high, it’s a red flag that the
imputation for those variables is untrustworthy. This could be do to too
much missingness, weak relationships among the variables making it hard
to predict, or low quality data (e.g., noisy, sparse). In such cases, we
want to DROP any variables with extremely high OOB
error because they add noise. A general rule of thumb is that if the PCF
> ~0.3 (30% of the imputed values are wrong) or if there is
especially high MSE (inflation) relative to the variance of the
variable, we should drop it. Let’s set those aside so we can drop them
from test and train both downstream!
NOTE 2: Even though the PCF is a bit too high, I am
choosing to retain
ComparedToNational_Hospital return days for pneumonia patients
because it is a primary predictor. This could be a poor choice, however!
Every decision we make as data scientists can come with
pitfalls.
## These look okay. Uncomment to see their variances.
#var(train$`Hospital return days for pneumonia patients`, na.rm = T)
#var(train$`Median time (minutes) patients spent in ED`, na.rm = T)
#var(train$`Healthcare workers given influenza vaccination`, na.rm = T)
cols2drop <- c("Payment for pneumonia patients", "Emergency department volume")
Alternatively, we could use MICE (Multivariate Imputation with Chained Equations), which is a method that will model uncertainty along with the variables. Do a little research. How does MICE work? Include a citation to a source.
Essentially, MICE is an iterative method of imputation, beginning with a simple assumption (i.e., the mean, median, etc.) imputation for missing values. Then, MICE uses the other variables in the data to model and predict a replacement for this simple assumption for the missing values in one variable, then using those new predicted values in conjunction with other variables to model and predict missing values in the other variables. This iterates through the data set, improving the imputed values in each variable from the baseline simple imputation. Reference: The MICE Algorithm, Sam Wilson (2021); https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html
We are almost done with imputation. The last thing we need to do is
extract the imputed values from imputed_data$ximp, and add
the State and target variable back. Then we get to wash,
rinse, and repeat for the test set! Oh my!
State and target
variables backimputedTrain <- imputedTrain$ximp %>%
## Put the original State and target variables back on
mutate(State = train$State,
PredictedReadmissionRate = train$PredictedReadmissionRate)
test# Remove encoded categorical variables for imputation, plus the target and state (which have no missing)
data_sans_encoded = test %>%
select(-PredictedReadmissionRate,
-State,
-contains("ComparedToNational"))
data_cat = test %>%
select(contains("ComparedToNational")) %>%
mutate_all(as.factor)
# Stitch them back together:
temp <- cbind(data_sans_encoded, data_cat)
# Impute missing values using missForest
imputedTest <- missForest(temp,
variablewise = TRUE,
verbose = FALSE)
#load(file = "FY2024_data_files/imputedTest.Rdata")
State and target
variables back on testimputedTest <- imputedTest$ximp %>%
## Put the original State and target variables back on
mutate(State = test$State,
PredictedReadmissionRate = test$PredictedReadmissionRate)
train and
testimputedTrain <- imputedTrain[, !(names(imputedTrain) %in% cols2drop), drop = FALSE]
imputedTest <- imputedTest[, !(names(imputedTest) %in% cols2drop), drop = FALSE]
Quickly explore how well the imputation did by choosing at least one
numeric variable and one of the categorical variables. Did the
distributions or frequencies change drastically? You may find the
summary() and table() functions to be fastest
here.
print(summary(test$`Perioperative pulmonary embolism or deep vein thrombosis rate`))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.870 3.140 3.470 3.563 3.900 6.550 27
print(summary(imputedTest$`Perioperative pulmonary embolism or deep vein thrombosis rate`))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.870 3.160 3.500 3.569 3.870 6.550
print(table(test$`ComparedToNational_MRSA Bacteremia`))
##
## -1 0 1
## 7 281 23
print(table(imputedTest$`ComparedToNational_MRSA Bacteremia`))
##
## -1 0 1
## 7 486 23
The imputation seems to have done well for ‘Perioperative pulmonary embolism or deep vein thrombosis rate’ as the summary statistics have shifted only slightly and 27 NAs have been removed. For ‘ComparedToNational_MRSA Bacteremia,’ however, the distribution has skewed substantially towards the middle, with over 200 0s being added and no -1s or 1s being added.
It’s time, yet again, to make a choice. If you opted to load the fully encoded dataset and everything has been done on that, you will run the first chunk. But if you decided that you wanted to wait to encode the TIME HAS FINALLY COME!
missForestUltimately, we need to turn the factor features back into numbers for
our next analyses but we had to turn them into factors for
missForest. So, now we need to turn them back. Notice a bit
of annoying trickery here. Because we had converted them to a factor
type, in order to back-convert them correctly to our original ordinal
encoding, we had to first convert to character and then to numeric. So,
our conversion goes factor \(\rightarrow\) character
\(\rightarrow\)
numeric in order to get exactly what we need! Again, we
must remember to perform on BOTH imputedTrain and
imputedTest.
## Uncomment if this is your choice!
# readyTrain <- imputedTrain %>%
# mutate_if(is.factor, as.character) %>%
# mutate_if(is.character, as.numeric)
#
# readyTest <- imputedTest %>%
# mutate_if(is.factor, as.character) %>%
# mutate_if(is.character, as.numeric)
The time has FINALLY come to do frequency encoding, if that is the
fork in the road you took all the way back in Section 1.3. Once again, we will be using a
helper function that we load with the source() function
like we did here. It will do the frequency
encoding for us on the State variable, if it needs to be
done. If you started from the fully encoded dataset, there is a
conditional statement that will skip this for you.
## This calls the helper function in the associated .R file
source(file = "doFrequencyEncoding.R")
## If State is not already a number
if(!is.numeric(imputedTrain$State)) {
## This sets a list of COLUMN NAMES I want to frequency encode
cols2encode <- c("State")
## This runs the function, which takes up to 4 arguments
## Open the source code to see what each argument does!
## It returns a LIST of frequency encoded dataframes.
freqEncoded <- doFrequencyEncoding(train = imputedTrain,
test = imputedTest,
cols2encode = cols2encode,
quiet = FALSE)
## Lastly, extract the newly encoded, imputed, dataframes!
readyTrain <- freqEncoded$train
readyTest <- freqEncoded$test
}
Because we are now frequency encoding AFTER data partitioning (we did
ordinal encoding BEFORE the split for convenience), we absolutely
must apply the same encoding map to the
test set that we did to the train set. Explain
why, and what should we do if some classes of the variable end up in the
train that is not in the test or vice versa?
(HINT: Look at the source code!)
We must use the training encoding map instead of creating an encoding map for the testing data to avoid any data leakage. We should not expose the model to the test set at any point, even in the encoding stage which seems at face value to not impact predictions in the end. Any exposure to the test set = data lieakage. In the case that there is a class contained only within the training set, there is no issue as the encoding map will account for it and just not apply it to the test set at all. For the inverse, the ‘new’ value in the test set that was not seen by the encoding map will register as an NA. The source code provided replaces these NAs with 0s to ensure usability and compatibility moving forward.
Ultimately, we know we will need to scale & center our data, as that is required for the analyses I have set out in our Analysis Plan. But before we get into that, it would be nice to know what realm of analyses in which we fall and, more importantly, do we need to consider transformation of our target in addition to the scaling & centering we plan to do on the entire dataset.
In other words: is our target variable approximately normally distributed?
Figure 7. Histogram of Raw Predicted Readmission Rate
Why is a histogram, or even a Q-Q plot, an insufficient way to assess whether a distribution is approximately normal?
They are decent graphical representations of a distribution. That being said, they are graphical, meaning they need to be looked at by a human to make the final call on whether or not the distribution is normal. I may think the above data looks normally distributed while a more trained eye might think otherwise, or vice versa. They are good for an initial check, but are not statistical proof by any means.
Now Apply a Shapiro-Wilk
test for normality using the shapiro.test() function.
What does it indicate about your distribution? (Note:
Shapiro tests are only reliable for \(N <
5000\), but it is fine to perform here.)
shapiro.test(readyTrain$PredictedReadmissionRate)
##
## Shapiro-Wilk normality test
##
## data: readyTrain$PredictedReadmissionRate
## W = 0.98544, p-value = 3.085e-14
shapiro.test(readyTest$PredictedReadmissionRate)
##
## Shapiro-Wilk normality test
##
## data: readyTest$PredictedReadmissionRate
## W = 0.99241, p-value = 0.01002
The above Shapiro-Wilk normality tests produce a p-value of <0.05 for the PredictedReadmissionRate for both the training and testing sets. This means that rejects the null-hypothesis of the Shapiro-Wilk test, which states that the data is normally distributed. This result suggests that the PredictedReadmissionRate is not normally distributed in either the training or test set.
The Box-Cox transformation is a family of power transformations used to stabilize variance and make data more normally distributed (Box & Cox, 1964). Recall from our brief discussion in lecture that the Box-Cox transformation involves the parameter \(\lambda\), which when applied to \(y\) yields the transformation. It is defined as
\(y^\lambda = \frac{y^\lambda - 1}{\lambda}\) for \(\lambda \ne 0\).
When \(\lambda = 0\), it is the same as the \(\ln(y)\). Our lecture slides have more reference values for how to interpret \(\lambda\). But if \(\lambda \ne 0\), then this means that the transformation is only applicable to positive values. The end result is that it helps improve the performance of statistical models that assume normality.
caret to
make our lives easier, BoxCoxTrans().bc_data <- BoxCoxTrans(readyTrain$PredictedReadmissionRate)
print(paste0("Box-Cox lambda = ", bc_data$lambda))
## [1] "Box-Cox lambda = -0.3"
We can see that the estimated \(\lambda = -0.3\), so approximately a square-root transform (see lecture slides for reference). I found that \(\lambda = -0.5\) in FY 2025, for those using that year’s data.
train data:readyTrain$bc_PredictedReadmissionRate <- predict(bc_data, readyTrain$PredictedReadmissionRate)
Can you see what it did to the distribution? look closely, if not. It’s subtle but discernible!!
Now repeat all of the steps to perform a Box-Cox transform on your
own to the imputedTest data. Why must we (1) also transform
the test data and (2) use the same \(\lambda\) as the training
data?
We must transform the test data as well to make sure that the two sets are on the same scale. Otherwise they are inherently no longer comparable and our model will have significant issues when predicting the test set. We cannot derive a new lambda from the test set as that would introduce another avenue for data leakage for the model.
bc_test <- BoxCoxTrans(readyTest$PredictedReadmissionRate)
print(paste0("Box-Cox lambda = ", bc_data$lambda))
## [1] "Box-Cox lambda = -0.3"
readyTest$bc_PredictedReadmissionRate <- predict(bc_data, readyTest$PredictedReadmissionRate)
ggplot(readyTest, aes(x = bc_PredictedReadmissionRate)) +
geom_histogram(alpha = 0.6,
fill = "hotpink",
color = "hotpink") +
theme_minimal() +
labs(title = "Box-Cox Transformed Predicted Readmission Ratio, N = 2,726 hospitals",
y = "Frequency",
x = "Predicted Readmission Ratio")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(readyTest$bc_PredictedReadmissionRate)
qqline(readyTest$bc_PredictedReadmissionRate,
col = "hotpink", lty = 2, lwd = 2)
# Make sure to check that your transform looks successful!
NOTE At this point, you have both an non-transformed
(PredictedReadmissionRate) and Box-Cox transformed
(bc_PredictedReadmissionRate) version of the target in your
data.
Now that we have applied a Box-Cox transform to the target variable, we can center and scale the remainder of the variables. We had to perform the Box-Cox prior to centering because the data has to be positive for a Box-Cox transformation (unless we switch to a Yeo-Johnson transform (Yeo & Johnson, 2000)). Centering will likely make a lot of the data negative, plus we know we have negatives in our dataset purposefully.
It’s time to drop the original target,
PredictedReadmissionRate, and center and scale everything
INCLUDING the target! Here it is for the training set:
readyTrain <- readyTrain %>%
## Drop the original target variable
select(-PredictedReadmissionRate) %>%
## Make anything that might still be a factor from encoding is back to number
mutate_if(is.factor, as.numeric) %>%
## Center and scale everything!
scale(center = TRUE, scale = TRUE) %>%
data.frame()
Do the same for your test data. Make sure to use
readyTest and overwrite it.
readyTest <- readyTest %>%
## Drop the original target variable
select(-PredictedReadmissionRate) %>%
## Make anything that might still be a factor from encoding is back to number
mutate_if(is.factor, as.numeric) %>%
## Center and scale everything!
scale(center = TRUE, scale = TRUE) %>%
data.frame()
Hint: Make sure to turn your centered, scaled matrix back into a dataframe! By default, it creates a matrix.
load(file = "FY2024_data_files/readyTrain.Rdata")
load(file = "FY2024_data_files/readyTest.Rdata")
As we finally, finally wrap-up pre-processing there is one final assessment we need to perform: multicollinearity. As I am certain you recall, multicollinearity is when we have high correlation between predictors (so much so, in fact, that it comes at the cost of any correlation with the target!). Recall from lecture that multicollinearity is when one or more of our predictors are moderately to highly correlated with each other. This can inflate our coefficients and thereby cause spurious associations through false positives (i.e., small p-values when we don’t actually have them!).
Here we will implement two common ways we assess multicollinearity: pairwise correlations and Variance Inflation Factors (VIFs).
Pairwise correlation is usually done as a Pearson or Spearman correlation. We will calculate a correlation matrix and look at the correlation of the variables within the training dataset to assess possible areas of multicollinearity. We can then visualize this matrix with a heatmap, similar to how we did with missingness, except now blue indicates a positive correlation, red indicates a negative correlation, and yellow indicates no (or zero) correlation. Darker colors indicate stronger correlations.
Figure 9. Correlation Heatmap of All Variables
Where do you see the highest potential for multicollinearity? Why do you say that?
I see the highest potential for multicollinearity in the “ComparedToNational_” variables and their associated source variables. They are inherently derived from other variables that are still included in the dataset, which will introduce multicollinearity between these pairs.
Which variable(s) seem to be most correlated with our target,
bc_PredictedReadmissionRate?
“ComparedToNational_ReturnDaysForPneumoniaPatients” seems to be the only standout with correlation to “bc_PredictedReadmissionRate,” though the survey results variables also have some positive correlation.
Another way we can assess multicollinearity is by fitting a multiple linear regression model and estimating the variance inflation factors, or VIF, which estimates how much the variance of an estimated regression coefficient increases if your predictors are correlated. Specifically, variance inflation can identify multicollinearity when the VIF is greater than ~4 and especially when it is above 10.
Predicted Readmission Rate as the
target.mod <- lm(bc_PredictedReadmissionRate ~ ., data = readyTrain)
vif()
function that is part of the car package.Note that I am not printing all of the VIFs, just the top 15 after sorting in descending order.
## Calculate the VIFs and put into a dataframe to print the table
vifResults <- vif(mod) %>% data.frame
## Change the column names of the dataframe
colnames(vifResults) <- c("VIF")
| VIF | |
|---|---|
| Overall.hospital.rating | 22.69 |
| Recommend.hospital | 14.77 |
| Nurse.communication | 8.35 |
| Care.transition | 8.02 |
| Staff.responsiveness | 5.26 |
| Communication.about.medicines | 4.59 |
| Doctor.communication | 4.19 |
| Discharge.information | 3.48 |
| Composite.patient.safety | 2.74 |
| Quietness | 2.35 |
| ComparedToNational_Composite.patient.safety | 1.87 |
| Death.rate.for.pneumonia.patients | 1.85 |
| Cleanliness | 1.79 |
| ComparedToNational_Death.rate.for.pneumonia.patients | 1.76 |
| Postoperative.respiratory.failure.rate | 1.56 |
Does multicollinearity seem to be an issue in this dataset and, if so, among which variables? Why do you come to this conclusion?
HINT 1: What was used to calculate
Overall Hospital Rating? Does this help us better
understand why the VIF is so large?
HINT 2: Is, for
example,Payment Category for pneumonia patients created
from Payment for pneumonia patients? [These are Medicare
payments]? Which one do you think we should keep amd why?
Multicollinearity is definitely an issue in this dataset. There are 7 variables with a VIF value of >4, suggesting possible multicollinearity issues in them, and two of those values are >10, more strongly suggesting an issue with multicollinearity. “Overall Hospital Rating” and “Recommend Hospital” are the biggest offenders with a VIF of 22.69 and 14.77, respectively.
Which variable(s) seem to be most highly correlated with our target? Is it a positive or negative correlation? Does it make sense?
summary(mod)
##
## Call:
## lm(formula = bc_PredictedReadmissionRate ~ ., data = readyTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2965 -0.4953 0.0022 0.5255 2.9863
##
## Coefficients:
## Estimate
## (Intercept) 3.755e-14
## MRSA.Bacteremia 2.764e-02
## Abdomen.CT.Use.of.Contrast.Material 4.584e-02
## Death.rate.for.pneumonia.patients -1.183e-01
## Postoperative.respiratory.failure.rate 7.028e-03
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 1.907e-02
## Composite.patient.safety -3.633e-02
## Medicare.spending.per.patient 1.084e-01
## Healthcare.workers.given.influenza.vaccination 4.240e-02
## Left.before.being.seen 6.526e-02
## Venous.Thromboembolism.Prophylaxis 1.013e-01
## Nurse.communication 3.114e-02
## Doctor.communication -5.668e-02
## Staff.responsiveness -7.418e-02
## Communication.about.medicines 1.655e-02
## Discharge.information -2.770e-02
## Care.transition 3.947e-02
## Cleanliness -5.227e-02
## Quietness -8.977e-02
## Overall.hospital.rating 1.851e-01
## Recommend.hospital -2.381e-01
## SurveyResponseRate -9.899e-03
## PaymentCategory.for.pneumonia.patients 1.451e-01
## ComparedToNational_MRSA.Bacteremia -9.310e-03
## ComparedToNational_Death.rate.for.pneumonia.patients 5.548e-02
## ComparedToNational_Composite.patient.safety -2.429e-02
## ComparedToNational_Hospital.return.days.for.pneumonia.patients -3.796e-01
## State 9.336e-02
## Std. Error
## (Intercept) 1.627e-02
## MRSA.Bacteremia 1.930e-02
## Abdomen.CT.Use.of.Contrast.Material 1.696e-02
## Death.rate.for.pneumonia.patients 2.215e-02
## Postoperative.respiratory.failure.rate 2.032e-02
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 1.807e-02
## Composite.patient.safety 2.693e-02
## Medicare.spending.per.patient 1.929e-02
## Healthcare.workers.given.influenza.vaccination 1.771e-02
## Left.before.being.seen 1.727e-02
## Venous.Thromboembolism.Prophylaxis 1.793e-02
## Nurse.communication 4.702e-02
## Doctor.communication 3.332e-02
## Staff.responsiveness 3.732e-02
## Communication.about.medicines 3.487e-02
## Discharge.information 3.037e-02
## Care.transition 4.606e-02
## Cleanliness 2.177e-02
## Quietness 2.493e-02
## Overall.hospital.rating 7.749e-02
## Recommend.hospital 6.252e-02
## SurveyResponseRate 1.940e-02
## PaymentCategory.for.pneumonia.patients 1.902e-02
## ComparedToNational_MRSA.Bacteremia 1.867e-02
## ComparedToNational_Death.rate.for.pneumonia.patients 2.156e-02
## ComparedToNational_Composite.patient.safety 2.223e-02
## ComparedToNational_Hospital.return.days.for.pneumonia.patients 1.782e-02
## State 1.925e-02
## t value Pr(>|t|)
## (Intercept) 0.000 1.000000
## MRSA.Bacteremia 1.432 0.152167
## Abdomen.CT.Use.of.Contrast.Material 2.703 0.006922
## Death.rate.for.pneumonia.patients -5.343 1.01e-07
## Postoperative.respiratory.failure.rate 0.346 0.729446
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 1.055 0.291621
## Composite.patient.safety -1.349 0.177355
## Medicare.spending.per.patient 5.618 2.17e-08
## Healthcare.workers.given.influenza.vaccination 2.394 0.016731
## Left.before.being.seen 3.779 0.000162
## Venous.Thromboembolism.Prophylaxis 5.648 1.83e-08
## Nurse.communication 0.662 0.507853
## Doctor.communication -1.701 0.089070
## Staff.responsiveness -1.988 0.046992
## Communication.about.medicines 0.475 0.635002
## Discharge.information -0.912 0.361830
## Care.transition 0.857 0.391662
## Cleanliness -2.401 0.016434
## Quietness -3.601 0.000324
## Overall.hospital.rating 2.388 0.017028
## Recommend.hospital -3.808 0.000144
## SurveyResponseRate -0.510 0.609916
## PaymentCategory.for.pneumonia.patients 7.630 3.47e-14
## ComparedToNational_MRSA.Bacteremia -0.499 0.618124
## ComparedToNational_Death.rate.for.pneumonia.patients 2.574 0.010121
## ComparedToNational_Composite.patient.safety -1.092 0.274778
## ComparedToNational_Hospital.return.days.for.pneumonia.patients -21.300 < 2e-16
## State 4.851 1.32e-06
##
## (Intercept)
## MRSA.Bacteremia
## Abdomen.CT.Use.of.Contrast.Material **
## Death.rate.for.pneumonia.patients ***
## Postoperative.respiratory.failure.rate
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate
## Composite.patient.safety
## Medicare.spending.per.patient ***
## Healthcare.workers.given.influenza.vaccination *
## Left.before.being.seen ***
## Venous.Thromboembolism.Prophylaxis ***
## Nurse.communication
## Doctor.communication .
## Staff.responsiveness *
## Communication.about.medicines
## Discharge.information
## Care.transition
## Cleanliness *
## Quietness ***
## Overall.hospital.rating *
## Recommend.hospital ***
## SurveyResponseRate
## PaymentCategory.for.pneumonia.patients ***
## ComparedToNational_MRSA.Bacteremia
## ComparedToNational_Death.rate.for.pneumonia.patients *
## ComparedToNational_Composite.patient.safety
## ComparedToNational_Hospital.return.days.for.pneumonia.patients ***
## State ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7647 on 2182 degrees of freedom
## Multiple R-squared: 0.4224, Adjusted R-squared: 0.4153
## F-statistic: 59.1 on 27 and 2182 DF, p-value: < 2.2e-16
“ComparedToNational_Hospital.return.days.for.pneumonia.patients” and “Recommend.hospital” have the strongest coefficients, suggesting the strongest correlation with “bc_PredictedReadmissionRate.”
Even though we plan to proceed with an elastic net (which is much more robust to multicollinearity), some of these VIFs are too high for comfort. We also want to pause and ask - were any of the variables used to make other variables? (The answer is “yes”!). Those are excellent candidates to scrub too; we want to use our domain knowledge and what our stakeholders are asking for to help us decide which ones to keep.
Because the goal is a deployable predictive model, we likely want
something more granular than most of the ComparedToNational
features can give us, with the exception of our key
predictor,
ComparedToNational_Hospital return days for pneumonia patients.
I will remind you of what we explored in Section 2 with this variable;
not only that, we can see how how highly correlated it is with
our target!.
## Drop those with VIF over 5
vifOver5 <- vifResults %>% filter(VIF >=5) %>% rownames()
print(vifOver5)
## [1] "Nurse.communication" "Staff.responsiveness"
## [3] "Care.transition" "Overall.hospital.rating"
## [5] "Recommend.hospital"
## Now drop them
readyTrain <- readyTrain %>% select(-any_of(vifOver5))
readyTest <- readyTest %>% select(-any_of(vifOver5))
ComparedToNational and
PaymentCategory features except
ComparedToNational_Hospital return days for pneumonia patientsreadyTrain <- readyTrain %>%
## Rename the column we want to keep
rename(Hospital.return.days.for.pneumonia.patients = ComparedToNational_Hospital.return.days.for.pneumonia.patients) %>%
## Drop all the other columns
select(-contains("Compared"), -contains("PaymentCategory"), -SurveyResponseRate)
readyTest <- readyTest %>%
## Rename the column we want to keep
rename(Hospital.return.days.for.pneumonia.patients = ComparedToNational_Hospital.return.days.for.pneumonia.patients) %>%
## Drop all the other columns
select(-contains("Compared"))
Predictive modeling is our ultimate goal, but let’s pretend our stakeholders have expressed a desire to better understand how the hospitals relate to one another and whether there are any general patterns we should be interested in for enhanced interpretation. After all, a prediction is one thing, but understanding the why is still usually important!
Thus, we have opted to undertake a Segmentation Analysis to see if there is any underlying segments, or clusters, of hospitals based on the data we have. We will use the unsupervised methods of Principal Component Analysis (PCA) and \(k\)-means clustering do this; it will help us to determine if there are broader classifications of our hospitals that we need to report back to the advocacy group.
First, we apply a PCA on the data and display the principal components. Note that we have the same number of principal components as we do variables in the data set! The cumulative proportion of variance adds to 100% by our last principal component; but we’re really more interested in the portion of variance explained by EACH of the principal components. This can help us figure out just how many PCs are important for explaining the most variance. For example, if we were interested in feature reduction, we could use the top PCs that explain, say, some portion of the variance that we wish to retain.
At the 11th principal component.
pca <- prcomp(readyTrain)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0083 1.3531 1.23770 1.15277 1.11902 0.98837 0.95617
## Proportion of Variance 0.2241 0.1017 0.08511 0.07383 0.06957 0.05427 0.05079
## Cumulative Proportion 0.2241 0.3258 0.41089 0.48472 0.55429 0.60856 0.65935
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.93867 0.92718 0.8807 0.85563 0.81030 0.73429 0.68600
## Proportion of Variance 0.04895 0.04776 0.0431 0.04067 0.03648 0.02995 0.02614
## Cumulative Proportion 0.70830 0.75606 0.7992 0.83983 0.87631 0.90626 0.93240
## PC15 PC16 PC17 PC18
## Standard deviation 0.65207 0.58828 0.50527 0.43607
## Proportion of Variance 0.02362 0.01923 0.01418 0.01056
## Cumulative Proportion 0.95603 0.97525 0.98944 1.00000
Principal component 1 captures only 22.41% of the variance. This is a relatively poor amount of variance to capture. In my experience, generally you want a higher percentage for PC1, like 35 - 50%, and a cumulative of 80+% when combining PC1, PC2, and sometimes PC3, in order to consider it an effective reduction in dimensionality. The final PC captures only 1.056% of the variance, which is not atypical, though 18 principal components is quite a few.
Figure 10. Scree Plot of First 15 Principal Components
How many PCs appear optimal to you based on the scree plot?
It looks like 5 PCs is a decent cutoff, as that is the last stark dropoff before flattening out, though the overall plot is relatively gradual so the cutoff could be earlier at 3 PCs, though that doesn’t capture much variance at all. Though holding out for 80+% variance until the 11th PC is not feasible if you actually want to reduce dimensionality.
I think it comes from geology? A scree is a cliff face that drops off quickly then turns into a more gradual decline as the smaller rocks tumble down.
Next, let’s explore how much the variables are contributing (this is based on the weight of their rotations). In PCA, we are interested in the loadings, i.e., the linear combination weights (coefficients) whereby unit-scaled components define or “load” a variable. Loadings help us interpret principal components.
You may have noticed we used the prcomp() function
above; prcomp() will return the loadings in the variable
$rotation, which contains a matrix of variable loadings.
These loadings have been determined from the eigenvectors, which without
getting into what that is (because we’d have to take a departure into
linear algebra), suffice it to say that they are how we are measuring
the direction and magnitude during each subsequent rotation of the
principal components as we measure the variance explained.
Additionally, we are going to color this by our grouping variable and
key predictor in the dataset,
Hospital.return.days.for.pneumonia.patients. Recall that
when we explored this earlier this variable is the performance of each
hospital in terms of its pneumonia-related return-to-hospital days
relative to the national average. In other words, hospitals that were
“Worse than average” had readmitted pneumonia patients with
longer-than-average stays.
Figure 11. PCA Biplot Showing the Top Feature that Contributes to Explained Variance
This biplot (Figure above) shows that, on principal components 1 and
2 (the PCs that account for the most variance in the
training data) there is subtle but noticeable grouping of hospitals in
terms of how long their readmitted pneumonia patients stayed in the
hospital. I have chosen to display the #1 feature that
contributes to that variance,
Communication.about.medicines.
Each quadrant on a PCA biplot corresponds to a different directional influence of the original variables. E.g., because it is in the top-right quadrant I where both PC1 and PC2 are positive, this suggests that communication about medicines are positively associated with variables that point into the same quadrant. Since it seems to correspond with where the “better” hospitals are, this suggests that these hospitals are made somewhat distinctly “better” at readmitted hospital stay lengths because of correlations with communication about medicines. (NOTE: If you look at the 2025 data, the bipolot is effectively inverted but the conclusions seem to be the same in 2025 as well).
A general rule of thumb is:
Play around with the graphical parameters in the biplot, especially
those I’ve marked with comments. Try adding, for example, more variables
to display rather than just the top contributor. NOTE
that if you find variables in the same quadrant together, this means
that they are correlated. What variable(s) can you find that
contrast with
Communication.about.medicines?
There are a lot that seem to be correlated with “Communication.about.medicines.” The two variables that most directly contrast with “Communication.about.medicines” (i.e., have a strong negative loading on PC1 and a slight positive loading on PC2) are “bc_PredictedReadmissionRate,” and “MedicareSpendingPerPatient.” The contrast with “bc_PredictedReadmissionRate” may suggest that better communication about medicines leads to reduced readmission rates for pneumonia patients (and/or vice versa).
Now, look at the contributions to PCs 1 and 2 (Figure Below) from the variables in the training dataset. Anything above the dotted red-line contributes significantly to the overall explanation of variance, and anything below the dotted line does not. Does anything about this plot stand out to you?
HINT 1: What do the top contributing variables all have in common? In other words, did they all come from the same data source?
HINT 2: Where does the target,
PredictedReasmissionRate, fall among the significant
variables?
Many of these top contributors come from the same survey response data, and generally point to patients having a ‘good experience’ or not. There are surprisingly few concrete, measureable metrics that contribute significantly, like MRSA, veinous thromboembolism prophylaxis, etc. “PreductedReadmissionsRate” falls right on the threshold of significant and insignificant.
Figure 12. Variables that Significantly Explain Variance.
\(k\)-means is a centroid-based clustering algorithm, where a “centroid” is the geometric center of an object often measured by Euclidean distance. In the algorithm, the distance between each data point and a centroid is calculated by randomly grabbing data; these distances are then used to assign the data point to a cluster. The goal is to identify the optimal \(k\) number of groups in a dataset by minimizing the distance of each datapoint to a respective centroid.
Although multiple methods exist to determine the optimal number of clusters, one commonly employed is heuristic, i.e., it isn’t really computational. You will either choose the optimal clusters based on a set of options of \(k\) graphically or you can once again employ the “elbow” method again, as you did with PCA. Let’s start by visually inspecting the effect of different \(k\) clusters on PC1 and PC2 of the dataset. I have chosen to test \(k\) groups of 2 through 9.
Which number of clusters, \(k\), do you think has the best explanatory power? Why? Is it is hard to tell?
By looking at the Biplots, it seems that K = 2 or K = 3 has the best explanatory power. I think that K = 3 is slightly better than K = 2, there are still another subgroup present and it has a good balance of simplicity and very strong explanitiory power. In our case, we dont need to many subgroups from finding pnemonia readmission rate. K = 3 also avoids overfitting and shows the structure of the data well.
Next, let’s employ the ‘elbow’ method again, this time to look at when the within sum of squares drops off rather than the percent variance, as was done in PCA.
Figure 14. WSS Scree plot for k-means
What is the within sum of squares and what exactly is it measuring here? By the elbow method, which \(k\) is the optimal number of clusters? Did it agree with your choice from the other heuristic method?
The ‘within sum of square’ is a measure of the total distance between each data point and the centroid of the cluster it belongs to (cluster compactness). In measuring terms, it is the sum of squared Euclidean distances from each point to its cluster’s center. In the above plot, it is used to apply the elbow method where we are trying to find the point where the decrease in WSS begins to level off (elbow), which suggests that adding more clustters beyond this point would yield diminishing returns in explaining variance. By looking at the plot, the WSS drops from k = 1 to k = 2 or 3, after which the rate of decrease slows down. The “elbow” appears most noticeably at k = 2. This means that adding clusters after this point doesnt reduce the WSS, making K = 2 the optimal number of clusters. Yes, it did agree with my choice from the other heuristic method.
Lastly, we will apply the silhouette method, which measures the average silhouette width, a combination of (1) how similar any given point is to its own cluster (also called cohesion) and (2) how different that same point is from other clusters (also called separation). Silhouette scores can range from \(-1 \rightarrow +1\), with larger values indicating better clustering (i.e., more cohesive clusters that are separated from each other). Thus, a silhouette score of 1 indicates perfect separation and cohesion whereas a silhouette of 0 indicates overlapping clusters.
What we can do is iteratively test our possible \(k\) values of 2 to 9 and a silhouette score will be calculated for each \(k\) tested. The highest silhouette score indicates the optimal \(k\).
Figure 14. Silhouette Plot for k-means
Are you surprised at the optimal \(k\) from the silhouette method? Does it seem to agree with what you’ve seen with the WSS elbow method? (Hint: agreement generally suggests evidence that you have found the optimal \(k\)).
Looking at the silhouette method, the optimal K = 2. I am not suprised by this becuase the results agree with what I was seeing with the WSS elbow method and visual inspection. Looking at our goal, simple with not too much overfitting seems good in our case of looking at pneumonia readmission rates, since we are looking at good and not so good hospitals, and we can get this with K = 2.
Lastly, let’s extract the optimal \(k\) from the silhouette method.
## Extract optimal k from silhouette method; first get the cluster data
clusterData <- fviz_nbclust(readyTrain, kmeans, method = "silhouette")$data
## Then extract the maximum y
k <- as.numeric(clusterData$clusters[which.max(clusterData$y)])
k
## [1] 2
As we said before, the goal of segmentation analysis is to broadly cluster our hospitals by their features to help us characterize them.
Let’s explore the clusters - segments - of the hospitals based on
their average values from the original dataset (we
start from imputedTrain, which hadn’t yet been transformed,
centered, or scaled, BUT the multicollinear variables also haven’t been
dropped either). Your task is to add comments to this code chunk.
## This runs k-means clustering on the transformed dataset
kmeansResult <- kmeans(readyTrain,
centers = k, ## sets the number of clusters
nstart = 50,
iter.max = 1000,
algorithm = "MacQueen")
## Creates a new version of the original dataset with clusters attached
kmeansTrain <- imputedTrain %>%
## Rename the column we want to keep
rename(Pneumonia.Hospital.Return.Days = `ComparedToNational_Hospital return days for pneumonia patients`) %>%
## Drops multicollinear variables, comparison scores, payment categories, and survey response rate
select(-any_of(vifOver5), -contains("Compared"), -contains("PaymentCategory"), -SurveyResponseRate) %>%
## Adds cluster assignment as a new variable and makes pneumonia return days column numeric
mutate(Cluster = kmeansResult$cluster,
Pneumonia.Hospital.Return.Days = as.numeric(if_else(Pneumonia.Hospital.Return.Days == "1", 1,
if_else(Pneumonia.Hospital.Return.Days == "0", 0, -1)))) %>%
## Cleans column names to lowercase with words separated by underscores for easier handling
clean_names()
## Using % contribution graph, the order of the top 10:
pcaImportantVars <- factor(c("Communication.about.medicines",
"Doctor.communication",
"Composite.patient.safety",
"Discharge.information",
"Quietness",
"Postoperative.respiratory.failure.rate",
"Cleanliness",
"Periop.Rate.Of.Embolism.Or.Thrombosis",
"Predicted.Readmission.Rate",
"Pneumonia.Hospital.Return.Days"))
## Converts variable names to lowercase for consistent formatting
pcaImportantVars <- tolower(pcaImportantVars)
## Replaces periods with underscores
pcaImportantVars <- gsub("\\.", "_", pcaImportantVars) %>%
## Converts to title case for better readability in plots
to_any_case("title")
## Creates boxplots to see important variables across clusters
kmeansTrain %>%
## Renames long variable names to readable format
rename(Predicted.Readmission.Rate = predicted_readmission_rate,
Periop.Rate.Of.Embolism.Or.Thrombosis = perioperative_pulmonary_embolism_or_deep_vein_thrombosis_rate) %>%
## Converts column names to title case, clean presentation for plots
clean_names(case = "title") %>%
## Selects only the cluster column and the top 10 important variables
select(Cluster, any_of(pcaImportantVars)) %>%
## Reshapes the data into long format
pivot_longer(-1, names_to = "Measure", values_to = "Value") %>%
## Sets order of the measures so they display consistently in plot
# filter(Measure %in% pcaImportantVars) %>%
mutate(Measure = factor(Measure, levels = to_any_case(c("Communication.about.medicines",
"Doctor.communication",
"Composite.patient.safety",
"Discharge.information",
"Quietness",
"Postoperative.respiratory.failure.rate",
"Cleanliness",
"Periop.Rate.Of.Embolism.Or.Thrombosis",
"Predicted.Readmission.Rate",
"Pneumonia.Hospital.Return.Days"), case="title"))) %>%
## Create boxplots to compare distributions of each measure across clusters
ggplot(aes(x = Cluster, y = Value, fill = as.factor(Cluster))) +
geom_boxplot(alpha=0.75) +
scale_fill_manual(values = c(skittles[1], skittles[6])) +
labs(title = "Cluster Means",
x = "",
fill = "Cluster") +
theme_classic() +
theme(legend.position = "bottom",
strip.text = element_text(size=8),
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
title = element_text(size = 12, color = "maroon")) +
## Separates the plots by measure, adjusting scales individually and arranging them in 2 columns
facet_wrap(~ Measure,
scales = "free_y",
ncol = 2)
Figure 16. Boxplot Comparisons of Important Variables by Cluster
## Making a summary stats table for each cluster with mean and median of each important variable
kmeansTrain %>%
## Renames the key variables to match the clean names
rename(Predicted.Readmission.Rate = predicted_readmission_rate,
Periop.Rate.Of.Embolism.Or.Thrombosis = perioperative_pulmonary_embolism_or_deep_vein_thrombosis_rate) %>%
## Cleans names to title case for better display
clean_names(case = "title") %>%
## Selects only cluster and important variables
select(Cluster, any_of(pcaImportantVars)) %>%
## Creates a summary table grouped by cluster
tbl_summary(by = "Cluster",
statistic = list(all_continuous() ~ c("{mean} ({median})"))) %>%
modify_header(label ~ "Variable",
all_stat_cols() ~ "**Cluster {level}**") %>%
modify_caption("Table 8. Summary Statistics of Important Cluster Variables")
| Variable | Cluster 11 | Cluster 21 |
|---|---|---|
| Communication About Medicines | 76.5 (76.0) | 70.9 (71.0) |
| Doctor Communication | 91.03 (91.00) | 87.68 (88.00) |
| Composite Patient Safety | 0.98 (0.96) | 1.06 (1.01) |
| Discharge Information | 87.0 (87.0) | 82.3 (83.0) |
| Quietness | 83.2 (83.4) | 77.1 (77.0) |
| Postoperative Respiratory Failure Rate | 8.69 (8.53) | 9.93 (9.25) |
| Cleanliness | 86.4 (86.8) | 82.1 (82.6) |
| Periop Rate of Embolism or Thrombosis | 3.55 (3.52) | 3.73 (3.59) |
| Predicted Readmission Rate | 15.77 (15.67) | 17.57 (17.38) |
| Pneumonia Hospital Return Days | ||
| -1 | 188 (14%) | 430 (50%) |
| 0 | 976 (72%) | 408 (47%) |
| 1 | 185 (14%) | 23 (2.7%) |
| 1 Mean (Median); n (%) | ||
HINT: If these are back to the original
values, this is more interpretable for our stakeholder. We will now read
this just like a typical boxplot; for example, median (the middle line)
Doctor Communication seems higher in Cluster
1 than it does in Cluster 2 with only some
overlap in the distributions (the boxes and whiskers don’t really
overlap each other). The table confirms this; the median
Doctor Communication score is 91% in Cluster 1 but only 88%
in Cluster 2.
Notice, for example, that Cluster 1 hospitals seem to include the hospitals that had the lowest pneumonia-related readmission rates, with the average rate ~2% higher in Cluster 2 hospitals. Thus, for our stakeholder, we might decide to rename Cluster 1 as “Highest Performing Hospitals” or something along those lines. Notice also that this cluster had more below national average return hospital days (a “1”). This means that they had more readmitted pneumonia patients who spent less time in the hospital upon readmission than the national average.
Look at how else you might broadly characterize - or segment - these hospitals. Your job is to make a table for your client, summarizing each of the two clusters with FOUR primary criteria. Make sure to give each cluster an informative name as I did; you’re welcome to rename Cluster 1 if you think of a catchier name than I chose!
After you come up with names of the two clusters, you will then come up with at least FOUR characteristics per cluster summarizing what you see in either the boxplots or table. You may choose to focus on different attributes for different clusters. Fill in the code for the table below. I start you out with two examples for Cluster 1, but feel free to add more!
| Top Defining Attributes | Description |
|---|---|
| 1: Highest Performing Hospitals | |
| Predicted Pneumonia-related hospital readmissions | These hospitals have a lower readmission rate, suggesting that hospitals in this cluster may be better at diagnosing and treating pneumonia |
| Return hospital days due to pneumonia | Readmitted patients spend below the national average number of days in hospital, suggesting these hospitals are able to more quickly treat and stabilize pnuemonia patients |
| Doctor communication | These hospitals have higher provider to patient communication, suggesting that doctors at these hospitals might be better at underdtanding the needs of their patients therefore are able to provid better care |
| Cleanliness and Quietness | These hospitals scored higher on cleanliness and quietness, suggesting that they are indicators of a comfortable, sanitary, and overall controlled environment |
| 2: Struggling Hospitals | |
| Predicted Pneumonia-related hospital readmissions | These hospitals have a higher readmission rate, suggesting that they are less effective at pneumonia treatment and or discharge planning |
| Return hospital days due to pneumonia | These hospitals have a greater proportion of patients that spend average or more days in the hospital on readmission, suggesting that they might have more severe complications or less efficient care |
| Communication about medicines | These hospitals have less effective communication about medications, which can lead to confusion and inproper treatment |
| Patient safety indicators | These hospitals have higher Composite safety scores, suggesting that there are more safety related incidents and or complications |
We don’t technically have to stop our segmentation analysis here, but we will so that we can focus on other analyses. One thing we might choose to do in the future, for example, is to choose a classification machine learning algorithm to test the predictions based on these clusters.
Make a recommendation for our client for 1-2 machine learning analyses your team would choose to use to test whether these clusters can be used for robust predictions of hospital performance for pneumonia patients. How would you know whether or not your predictions were robust? What would you look for or compare?
In order to test whether these clusters can be used to make robust predictions of hospital perfomance for pneumonia patients, I would recommend using the following two machine learning classification models: 1. Random Forest Classifier: which can create models that handle nonlinear relationships and high dimensional data, rank feature importance, and help understand which hospital characteristics best differentiate the clusters. 2. Logistic Regression: which acts as a good basline to compare against more complex algorithms, it provides a simple model, and it can be used to identify linear relationships between hospital attributes and clusters. To evaluate robustness of predictions, we would look at Accuracys, Precisions & Recalls, F1 Scores, Confusion Matrixs, do Cross-validation, and looke at the AUC-ROC Curves. We can compare the models by looking at the performance of the Random Forest vs Logistic Regression. We can look at the model stability across cross-validation folds, look at feature importance (for Random Forest) or coefficients (for Logistic Regression) to make sure that the meaningful variables are driving predictions. If both models perform well and generalize across folds, we can be confident the clusters are predictive and actionable!
We are going to perform two supervised learning methods: an Ordinary Least Squares (OLS) regression followed by an Elastic Net.
Earlier in Section 4 when we assessed multicollinearity, we fit an OLS regression model so that we could calculate variance inflation factors (VIFs). However, because OLS regression is so incredibly sensitive to multicollinearity, we should re-run it now that we’ve removed some of the most redundant variables.
Let’s model that now, using the readyTrain dataset and
plot some diagnostic plots. to assess the key assumptions of an OLS
linear regression. The four plots are diagnostic plots for the primary
assumptions of a linear regression:
Linearity between the outcome and the predictors
Normally distributed residuals
Homosecadasticity (equality of variances) in the residuals
No high leverage/influence points (no significant outliers)
What is your general assessment of our OLS assumptions? Met or unmet? Why?
My overall assessment of the OLS assumptions are that most OLS assumptions are reasonably met. Linearity is mostly met, with some minor nonlinear patterns. It looks like there is a slight curve upward. Normality is also mostly met, though a few outliers deviate from perfect normality. Homoscedasticity is met, with no clear funnel shape. No high leverage points is partially unmet, due to 3 influential observations.
Note: If you are not entirely sure about the Q-Q plot, you can run a Shapiro-Wilk test on the residuals using the following code:
shapiro.test(residuals(mod))
##
## Shapiro-Wilk normality test
##
## data: residuals(mod)
## W = 0.99862, p-value = 0.06735
The most important OLS assumption for Elastic Net is linearity of the relationship between predictors and the response. The other, stricter OLS assumptions don’t matter as much because of regularization.
Do you find support for linearity here? Why or why not?
Yes, there is sufficient support for linearity. The Residuals vs Fitted plot shows residuals mainly centered around zero with no strong patterns, suggesting an overall linear relationship. While there are slight deviations, they are minor and not severe enough.
## [1] 0.39427
It is a surprisingly high (although it isn’t stellar). You may need
to do some outside research, but explain in a sentence or two why,
generally speaking, we use adjusted \(R^2\) and not the unadjusted one
(confusingly labeled as multiple r-squared in the
summary() output below).
We use adjusted R^2 instead of the unadjusted R^2 because adjusted R^2 accounts for the number of predictors in the model and penalizes for adding variables that do not improve model performance. Unlike the regular R^2 , which always increases when more variables are added, adjusted R^2 can decrease if the new predictors do not contribute meaningfully to explaining the variance in the outcome. This makes it a more reliable metric for comparing models with different numbers of predictors (James et al., Chapter 3, Section 3.1.2 – “Assessing the Accuracy of the Model.”).
Use the step() function to do a backward regression to
find our most parsimonious model. Stepwise regression
drops terms that are not significant or important, thereby only keeping
the features that contribute to the overall explanation of variance in
predicted readmission rate.
## Perform a backward, stepwise regression to find the most parsimonious model
bestMod <- step(mod,
direction = "backward",
trace = FALSE)
Lastly, let’s take a peek at the results. We will use a package
called stargazer to help us visualize the results in a
nicely organized way.
## Print a table using stargazer
stargazer(bestMod,
type = "html",
title = "Table 10. Parsimonious OLS Regression Results.")
| Dependent variable: | |
| bc_PredictedReadmissionRate | |
| MRSA.Bacteremia | 0.039** |
| (0.017) | |
| Abdomen.CT.Use.of.Contrast.Material | 0.045*** |
| (0.017) | |
| Death.rate.for.pneumonia.patients | -0.154*** |
| (0.017) | |
| Medicare.spending.per.patient | 0.167*** |
| (0.018) | |
| Healthcare.workers.given.influenza.vaccination | 0.044** |
| (0.018) | |
| Left.before.being.seen | 0.062*** |
| (0.017) | |
| Venous.Thromboembolism.Prophylaxis | 0.101*** |
| (0.018) | |
| Doctor.communication | -0.074*** |
| (0.027) | |
| Discharge.information | -0.053** |
| (0.024) | |
| Cleanliness | -0.053*** |
| (0.020) | |
| Quietness | -0.077*** |
| (0.022) | |
| Hospital.return.days.for.pneumonia.patients | -0.396*** |
| (0.018) | |
| State | 0.110*** |
| (0.019) | |
| Constant | 0.000 |
| (0.017) | |
| Observations | 2,210 |
| R2 | 0.398 |
| Adjusted R2 | 0.395 |
| Residual Std. Error | 0.778 (df = 2196) |
| F Statistic | 111.824*** (df = 13; 2196) |
| Note: | p<0.1; p<0.05; p<0.01 |
Lastly, let’s assess how good a predictive model our OLS regression
model is. We will use the predict() function to first fit
our most parsimonious model bestMod to the test data:
## Make the predictions
predictions <- predict(bestMod, newdata = readyTest)
The Root Square Mean Error (\(RMSE\)) is the square root of the average of the squared differences between predicted and actual value, such that \(RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n\cdot(y_i-\hat{y}_i)^2}\) where \(y_i\) is each observed value, \(\hat{y}_i\) is each corresponding predicted value, and \(n\) is the number of observations. NOTE: Recall that a residual is \(y_i-\hat{y}_i\) - the distance between every point and its predicted value! It’s also important to realize that the \(RMSE\) gives more weight to large errors because of squaring. Thus, it’s sensitive to outliers.
Can you interpret what the RMSE tells your client here? Why or why not? What about the \(R^2\)?
Hint 1: Remember that the outcome here is Box-Cox transformed. Would you need to undo the transformation to be able to assess what the RMSE (or \(R^2\)) is actually telling us? Recall that the Box-Cox \(\hat{y} = \frac{1}{y^\lambda}\) where our \(\lambda = -0.3\) AND we also centered and scaled these data.
Hint 2: Is it practical to undo both transformations? How does that hinder our ability to make an interpretation for our client?
No, we cannot clearly interpret the RMSE or R^2 in practical terms for the client without undoing the Box-Cox transformation, centering, and scaling of the outcome variable. The outcome variable predicted readmission rate was Box-Cox transformed with λ = −0.3 and then standardized, so the RMSE reflects error on a transformed scale rather than in the original percentage units, making it uninterpretable. Also, the R^2 value of 0.398 indicates that the model explains about 40% of the variance in the transformed and standardized readmission rate, but this does not directly reflect how much variance is explained in the original, real-world percentage scale. However, we can use them to compare model performance internally across the different models, assuming the outcome remains consistently transformed.
Lastly, let’s graph the effect predictive ability of our OLS model by
comparing the relationship between the true values of
bc_PredictedReadmissionRate from the readyTest
dataset to the predictions we made using the model on the testing
data.
Figure 18. Predictions Plot for OLS Regression
NOTE that, while there is an obvious linear trend, it is weak. There is a lot of error around the predictions from the OLS regression!
Perhaps the seemingly low \(R^2\) is
because we still have too much multicollinearity or overfitting. So, as
we discussed in lecture, we can address this using regularized
regressions, such as the Elastic Net, which is more
robust to multicollinearity by assessing penalties for non-zero
variance. Typically, we would likely conduct a Ridge and/or LASSO before
running an Elastic Net, but since we already know that (1) we have a
fairly high amount of multicollinearity in our original
readyTrain dataset and (2) we know we want to be able to
perform feature reduction and figure out which features are important
predictors of hospital readmissions, we’re going to jump straight into
Elastic Net, which performs both.
In general, regularization is a technique that applies a penalty term to a cost function of a machine learning model, like the OLS regression. The reason is to discourage overfitting. This penalty term constrains the model’s coefficients, which limits how flexible they are during training. Why?
The more flexible the coefficients, the less likely they will “learn” new data, i.e., the more generalizable they are! Thus, by applying this penalty (constraint), it improves the model’s performance on “unseen” or new data.
As mentioned in lecture, Elastic Net regression is a general regularization technique that combines the regularization techniques of Ridge and LASSO to help us perform both feature selection (i.e., significant coefficients) and feature reduction (importance). The elastic net can even help us with the \(P >> n\) problem! Elastic net has two regularization terms, called L1 (or the Lasso regularization term) and L2 (the Ridge term).
L1: makes some of the coefficients zero, thereby selecting only the most important features (shrinks them to zero). It is often said that this encourages sparsity in the model.
L2: reduces the coefficients to small but non-zero values, thereby reducing the coefficients of the unimportant features (which we can use to determine significance). It does this by adding a penalty term to the cost function of the model, which is proportional to the coefficient squared. This allows us to retain all the features in the model but reduces the overall impact of the non-significant or less-important ones.
Thus, by combining these techniques together, the Elastic Net is balancing feature selection with feature reduction! If you’re interested in reading the original paper on Elastic Net by Zou and Hastie (2005), you can find a free PDF of the paper here.
The elastic net can be run using the glmnet package
(which we are accessing via caret), and can be
cross-validated as random forest and SVM from our last project could be.
This is a big improvement over OLS regression, which cannot be tuned.
Additionally, the elastic net has two hyperparameters
that we can tune, \(\alpha\) and \(\lambda\) (not to be confused with the
\(\lambda\) from Box-Cox transformation
- Greek letters are abundant in statistics! Why do you think I named our
course teams as I did?!).
\(\alpha\) is the strength of the regularization, representing the balance between L1 and L2 regularization. Larger \(\alpha\) results in L1 regularization (feature selection / Ridge) whereas smaller \(\alpha\) results in more L2 regularization (higher shrinkage / feature reduction / Lasso). When \(\alpha=0\) the regression is equivalent to OLS regression! \(\alpha\) is also referred to as the mixing percent of L1 and L2 regularization.
\(\lambda\) is the shrinkage parameter, such that when \(\lambda = 0\) no shrinkage is performed and is equivalent to an OLS regression. As \(\lambda\) increases, the coefficients are increasingly shrunk and thus the more features will have coefficients shrunk to zero, regardless of the value of \(\alpha\).
trainControl() function in
caret.NOTE that I am performing a 5-fold repeated cross-validation (CV) with 5 iterations.
ctrl <- trainControl(method = "repeatedcv", ## Do repeated CV
number = 5, ## Number of k-folds
repeats = 5, ## Number of repeats
search = "grid", ## Grid search (vs. random)
verboseIter = FALSE) ## Don't show me all the results
train()
function in caret, specifying that we want to perform an
elastic net, which will tune the model and perform CV per the
specifications in the trainControl(). NOTE
that I am allowing the values of \(\alpha\) to range from \(0 \rightarrow 1\) and allowing \(\lambda\) to range from \(0 \rightarrow 5\).## Create a search grid that allows alpha to range from 0 --> and
## allows lambda to range from 0 --> 5
searchGrid <- expand.grid(.alpha = seq(0, 1, length.out = 10),
.lambda = seq(0, 5, length.out = 15))
## Fit the elastic net using the tuning grid and CV scheme laid out
elasticMod <- train(bc_PredictedReadmissionRate ~ .,
data = readyTrain,
method = "glmnet",
tuneGrid = searchGrid,
trControl = ctrl)
Explain what we are doing here. You may want to look up the
parameters of the trainControl() and train()
functions. Make sure to address what type of CV and hyperparameter
search is being performed, as well as if both the \(\alpha\) and \(\lambda\) hyperparameters of elastic net
are being tuned.
Hint 1: Why are we doing cross-validation, and specifically why did I choose the one I did?
Hint 2: What does \(\lambda = 0\) mean? How about \(\alpha = 0\)?
We are using repeated 5-fold cross-validation with grid search to train an Elastic Net regression model, simultaneously tuning both: \(\alpha\), which balances L1 (LASSO) and L2 (Ridge) penalties, and \(\lambda\), which controls the overall strength of the regularization. This approach helps mitigate overfitting, addresses multicollinearity, and selects the most predictive features, making the final model more robust and interpretable.
| alpha | lambda | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|---|
| 0.00 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.33 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.56 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.44 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.22 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.67 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.78 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.11 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 1.00 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.89 | 0.00 | 0.78 | 0.39 | 0.62 | 0.02 | 0.03 | 0.02 |
| 0.00 | 0.36 | 0.79 | 0.39 | 0.63 | 0.02 | 0.03 | 0.02 |
| 0.11 | 0.36 | 0.80 | 0.38 | 0.63 | 0.02 | 0.03 | 0.02 |
| 0.00 | 0.71 | 0.80 | 0.38 | 0.64 | 0.02 | 0.03 | 0.02 |
| 0.00 | 1.07 | 0.82 | 0.37 | 0.65 | 0.02 | 0.03 | 0.02 |
| 0.00 | 1.43 | 0.83 | 0.36 | 0.66 | 0.02 | 0.04 | 0.02 |
NOTE: Folds that fail to converge will have an \(R^2\) = NA.
\(\alpha\):
## [1] 0
\(\lambda\):
## [1] 0
Check the values of \(\alpha\) and \(\lambda\) here. What do they suggest about the optimal regularization that is being fit here? (Recall that when both \(\alpha\) and \(\lambda\) are zero, it’s an OLS regression!)
With the values of \(\alpha\) and \(\lambda\) both being 0, these values suggest that no regularization is being applied at all. The Elastic Net tuning process confirms that OLS regression performs just as well or better than regularized models in this case under the current transformations and data conditions. So, therefore regularization wasn’t necessary but testing for it can still be valuable.
Let’s also plot the results of the tuning search. Can you figure out the type of tuning search you did? (Hint: the visual pattern should confirm your answer from earlier!) The diamond denotes the best combination of hyperparameters!
results <- elasticMod$results %>%
data.frame()
best <- results[as.numeric(rownames(elasticMod$bestTune)), ]
Figure 19. Hyperparamater Grid Search for Elastic Net
Important features can also be extracted, using the
varImp() function from caret.
important <- varImp(elasticMod)$importance
Figure 20. Variable Importance from Best Elastic Net Model
Which of the features were most important? How does this compare to the most important features out of OLS regression? Make sure you’re looking at the absolute value of the beta coefficients to assess importance in OLS regression.
The most important features from the Elastic Net were Hospital return days for pneumonia patients, Medicare spending per patient, Death rate for pneumonia patients, State, and Venous Thromboembolism Prophylaxis. The most important features from the from the OLS Regression (looking at the absolute values) were Hospital return days for pneumonia patients: |–0.396|, Medicare spending per patient: |0.167| Death rate for pneumonia patients: |–0.154|, Venous Thromboembolism Prophylaxis: |0.101|, State: |0.110|. We can see that the most important features in Elastic Net match exactly with the top 5 in OLS regression, based on the absolute value of the coefficients. This shows that there is a strong consistency between the two models in terms of finding the influential predictors.
How about about of PCA? You may have already noticed that importance almost seems inverted! Why?!
Hint 1: What is the fundamental difference between a PCA and any regression?
Hint 2: What does “importance” mean in a PCA vs. any regression?
Hint 3: What does the sign of the loadings mean in a PCA? Are they actual or arbitrary?
PCA and regression measure importance differently because they have different overall goals. Regression will find features that best predict the outcome (Y), so importance is based on the size of the effect on Y. PCA finds features that explain the most variance in the predictors (X), while ignoring the outcome, so importance is based on how much a feature varies across samples. Also, the signs of PCA loadings are arbitrary and don’t indicate direction like regression coefficients do. Because of this, a feature can be very important for predicting the outcome (high regression coefficient) but not vary much (low PCA loading), or the other way around, overall making importance seem inverted between the two methods (Jolliffe, 2002).
Assessing model performance means assessing how well
our training model does when applied to “unseen” data, here our
readyTest dataset. Because we know the right answer - the
true value of PredictedReadmissionRate, we can assess the
error that the training model produces when fitted against the testing
data. We measure that error as \(RMSE\)
and \(MAE\). We briefly reviewed what
the \(RMSE\) measures back in Question 36, so I will only review what the \(MAE\) is here.
Mean Absolute Error (\(MAE\)) is the average of the absolute differences between predicted and actual values, such that \(MAE = \frac{1}{n}\sum_{i=1}^n\cdot|y_i-\hat{y}_i|\) where \(y_i\) is each observed value, \(\hat{y}_i\) is each corresponding predicted value, and \(n\) is the number of observations. So, unlike \(RMSE\), \(MAE\) treats all errors equally and therefore is more robust to outliers. So, \(MAE\) will tell you the average magnitude of the error, regardless of direction!
Let’s also make predictions on the testing sets for both the OLS and
Elastic Net models using the postResample() function in
caret; this will not only do predictions, but it will
enable us to more easily generate metrics that we can use to compare
model performance, such as \(R^2\) and
MAE.
## Post-resample on OLS
prOLS <- postResample(pred = predict(mod, newdata = readyTest),
obs = readyTest$bc_PredictedReadmissionRate)
## Post-resample on Elastic Net
prElastic <- postResample(pred = predict(elasticMod, newdata = readyTest),
obs = readyTest$bc_PredictedReadmissionRate)
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| prOLS | 0.81 | 0.35 | 0.64 |
| prElastic | 0.81 | 0.35 | 0.64 |
Using the same method, check for overfitting / underfitting. Make
predictions using readyTrain rather than
readyTest, and then check how the RMSE on the training data
compares to the RMSE of the training data for the Elastic Net. How do
they compare? (Recall what we discussed when we did the gene expression
demo: the relationship of the test vs. training accuracy can be used to
assess overfitting/underfitting. The same logic applies to RMSE too!)
How do both compare to what we got out of the OLS
regression? Which would you say is the better predictive model,
the OLS regression or the Elastic Net? Why?
It looks like the training and test RMSE values for both models are almost identical: RMSE = 0.78 on training and RMSE = 0.81 on testing. Since there is such a small difference, it suggests that neither model is overfitting or underfitting. Since both have almost similar performances on both the traing and test sets, I would say that Elastic Net is still preferred. It has built in regularization which can prevent overfitting when the dataset has many features or has multicollinearity. It also produces a more stable and interpretable models where there is more complexity.
## Post-resample on OLS
prOLS <- postResample(pred = predict(mod, newdata = readyTrain),
obs = readyTest$bc_PredictedReadmissionRate)
## Post-resample on Elastic Net
prElastic <- postResample(pred = predict(elasticMod, newdata = readyTrain),
obs = readyTest$bc_PredictedReadmissionRate)
## Post-resample on OLS (fixed)
prOLS <- postResample(pred = predict(mod, newdata = readyTrain),
obs = readyTrain$bc_PredictedReadmissionRate)
## Post-resample on Elastic Net (fixed)
prElastic <- postResample(
pred = predict(elasticMod, newdata = readyTrain),
obs = readyTrain$bc_PredictedReadmissionRate
)
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| prOLS | 0.78 | 0.4 | 0.61 |
| prElastic | 0.78 | 0.4 | 0.61 |
Figure 19. Predictions Plot for Elastic Net vs. OLS Regression
It should not come as a surprise, but these models are performing VERY similarly!
What conclusions do you make at this point about whether to make predictions with an OLS regression, Elastic Net, or neither? What recommendations would you make to our client at this point? What other analyses might your suggest, or other data we might want to include? Please try to flesh out 2-3 recommendations in at least a paragraph.
HINT: Make sure to ask yourself whether these data appear linear or not. Could non-linearity be a cause for a low \(R^2\)? If not, what other causes are there and how will that impact your recommendations?
We can see that both the OLS and Elastic Net models are performing very similarly. They both show consistent results between the training and testing data, which is a good sign. It means they are not overfitting or underfitting. However, they are only explaining about 35–40% of the variation in readmission rates. That means that while the models are stable, they’re not strong predictors. Between the two models, Elastic Net may still be the better option because it can handle data that is more complex and also reduce noise from the variables that are less useful. Overall, neither model is giving us the level of accuracy we would want to have to have confident predictions. There might be a few posibilities for that. One possibility is that the relationship between the predictors and readmission rates isn’t actually linear, and both OLS and Elastic Net assume a linear relationship. As we saw earlier with the OLS Regression Assumption Diagnostic Plots, linearity was mostly met, with some minor nonlinear patterns. There was a slight upward curve in the Residuals vs. Fitted graph. To help with this, using more flexible models like random forests, boosted trees, or other nonlinear models might help to see patterns that are missing. Also, looking further at the data to see if ther might have been important factors that are not be captured in the current model. Expanding the dataset with additional variables that could impact readmission rates could lead to stronger, more accurate predictions.
It’s time to choose your adventure! After working through this demo and answering the 31 questions, you will choose one of three possible trajectories. The following are some brief hints or tips for each of the choices available to you.
[Continue analyzing the pneuomnia data.]
Make sure to justify your choice.
Choose at least two new algorithms to perform. They can be unsupervised or supervised, but must be 2+.
A natural extension would be to do a Ridge and/or a LASSO; the former would tell you if you should be controlling for multicollinearity more than you currently are, the latter would tell you if you need to further shrink coefficients than you currently are. Additionally, you could consider a regression-based method that incorporates non-linearity, such as a random forest. However, you are free to choose anything you think is appropriate here!
Other choices I think you could potentially justify to stakeholders, if you were interested:
load(file = “FY2024_data_files/readyTest.Rdata”)`
kNN using the 2 clusters we’ve identified and named in our segmentation analysis
Our data are also ready for a neural network analysis (e.g., a feed-forward NN)!
shared_cols<-intersect(colnames(readyTrain),colnames(readyTest))
readyTrain<-readyTrain[,shared_cols]
readyTest<-readyTest[,shared_cols]
We began our analysis with a Ridge regression model due to its effectiveness in handling multicollinearity. Even after preprocessing and feature reduction, our dataset remains fairly large and may still contain correlated predictors. Ridge regression shrinks coefficient estimates, which helps reduce variance, prevent overfitting, and improve model generalization to new data. This approach allows us to retain all predictors while stabilizing the estimates produced from the data set.
x.train <- as.matrix(readyTrain[,-18])
y.train <- as.matrix(readyTrain[,18])
y.test <- as.matrix(readyTest[,18])
x.test<- as.matrix(readyTest[,-18])
#running the ridge model
set.seed(42)
ridge.fit <- glmnet(x.train,y.train,alpha=0)
cv.ridge.fit <- cv.glmnet(x.train,y.train,alpha=0)
#using cross-val to choose the best lambda
bestlam <- cv.ridge.fit$lambda.min
bestlam
## [1] 0.05010813
#calculating the test error
ridge.pred <- predict(ridge.fit, s=bestlam, newx=x.test)
ridge.error <- mean((ridge.pred-y.test)^2)
ridge.error
## [1] 0.6505919
Given that the Ridge regression produced a test error of 0.65, we also applied a LASSO regression model to evaluate whether we could improve performance and model interpretability. Like the ridge regression model, the LASSO model also manages large coefficients, but it also performs feature selection by shrinking some coefficients exactly to zero. This is useful for our data set, as it highlights the most predictive variables for pneumonia-related readmissions while discarding irrelevant or redundant ones.
x.train <- as.matrix(readyTrain[,-18])
y.train <- as.matrix(readyTrain[,18])
y.test <- as.matrix(readyTest[,18])
x.test<- as.matrix(readyTest[,-18])
#running the lasso model
set.seed(42)
lasso.fit <- glmnet(x.train,y.train,alpha=1)
cv.lasso.fit <- cv.glmnet(x.train,y.train,alpha=1)
#using cross-val to choose the best lambda
bestlam <- cv.lasso.fit$lambda.min
bestlam
## [1] 0.01006789
#calculating the test error
lasso.pred <- predict(lasso.fit, s=bestlam, newx=x.test)
lasso.error <- mean((lasso.pred-y.test)^2)
lasso.error
## [1] 0.6522352
We chose to use the Random Forest model to analyze our data for several important reasons. This model is well-suited for large and complex datasets, like our pneumonia dataset, and is great at handling non-linear relationships that may exist between variables in the data set. Random Forest builds multiple decision trees and aggregates their results, which increases predictive accuracy and reduces overfitting, making the model more robust and reliable. Additionally, it provides an analysis of feature performance by ranking variable importance, offering valuable insights into which factors most influence pneumonia-related hospital readmissions. This combination of flexibility, accuracy, and interpretability makes Random Forest a strong choice for our analysis.
We chose to use Hospital.return.days.for.pneumonia.patients as our Target variable. We first need to make it into a Factor.
# Finding the unique values to make the target variable a factor
unique(readyTrain$Hospital.return.days.for.pneumonia.patients)
## [1] -1.397867 0.318403 2.034673
unique(readyTest$Hospital.return.days.for.pneumonia.patients)
## [1] -1.4221282 0.3209078 2.0639438
# Defining the breakpoints and labels based of the unique values
breaks <- c(-Inf, -1.34, 0.33, 2.1)
labels <- c("Low", "Medium", "High")
# Creating factor variable for Train data
readyTrain$ReturnDaysCategory <- cut(
readyTrain$Hospital.return.days.for.pneumonia.patients,
breaks = breaks,
labels = labels,
right = TRUE, # include upper bound in interval
include.lowest = TRUE
)
# Creating factor variable for Test data
readyTest$ReturnDaysCategory <- cut(
readyTest$Hospital.return.days.for.pneumonia.patients,
breaks = breaks,
labels = labels,
right = TRUE, # include upper bound in interval
include.lowest = TRUE
)
# Confirming the transformations worked
table(readyTrain$ReturnDaysCategory)
##
## Low Medium High
## 618 1384 208
str(readyTrain$ReturnDaysCategory)
## Factor w/ 3 levels "Low","Medium",..: 1 2 2 2 2 2 2 2 2 2 ...
table(readyTest$ReturnDaysCategory)
##
## Low Medium High
## 141 329 46
str(readyTest$ReturnDaysCategory)
## Factor w/ 3 levels "Low","Medium",..: 1 2 2 2 2 1 2 1 2 2 ...
It looks like the classes are unbalanced. We are going to balance the classes using class weights that we chose based off of our own logic.
# Defining weights based on own logic
class_weights <- c(Low = 2, Medium = 1, High = 5)
# Original data
table(readyTrain$ReturnDaysCategory)
##
## Low Medium High
## 618 1384 208
# Creating a vector of weights for each row according to its class
row_weights <- class_weights[as.character(readyTrain$ReturnDaysCategory)]
# Replicating rows by weights
weighted_train <- readyTrain[rep(seq_len(nrow(readyTrain)), times = row_weights), ]
# New class distribution after replication
table(weighted_train$ReturnDaysCategory)
##
## Low Medium High
## 1236 1384 1040
Now that the classes have a better balance, we will run the Random Forest Model.
# Loading package
library(randomForest)
# Making sure that target is a factor
weighted_train$ReturnDaysCategory <- as.factor(weighted_train$ReturnDaysCategory)
# Defining the target
target_variable <- "ReturnDaysCategory"
# Excluding the numeric version of the return days from the predictors
predictors <- setdiff(names(weighted_train), c("ReturnDaysCategory", "Hospital.return.days.for.pneumonia.patients"))
# Building the formula
rf_formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = " + ")))
# Fitting the random forest classification model
set.seed(42)
rf_model_classification <- randomForest(
rf_formula,
data = weighted_train,
importance = TRUE,
ntree = 500,
)
# Viewing the model summary
print(rf_model_classification)
##
## Call:
## randomForest(formula = rf_formula, data = weighted_train, importance = TRUE, ntree = 500, )
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 6.78%
## Confusion matrix:
## Low Medium High class.error
## Low 1236 0 0 0.0000000
## Medium 233 1136 15 0.1791908
## High 0 0 1040 0.0000000
# Predict on test set
test_predictors <- predict(rf_model_classification, newdata = readyTest)
# Evaluate performance
confusion_matrix <- confusionMatrix(test_predictors, readyTest$ReturnDaysCategory)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 83 67 2
## Medium 57 256 44
## High 1 6 0
##
## Overall Statistics
##
## Accuracy : 0.657
## 95% CI : (0.6142, 0.6979)
## No Information Rate : 0.6376
## P-Value [Acc > NIR] : 0.1924
##
## Kappa : 0.2811
##
## Mcnemar's Test P-Value : 1.367e-06
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.5887 0.7781 0.00000
## Specificity 0.8160 0.4599 0.98511
## Pos Pred Value 0.5461 0.7171 0.00000
## Neg Pred Value 0.8407 0.5409 0.90963
## Prevalence 0.2733 0.6376 0.08915
## Detection Rate 0.1609 0.4961 0.00000
## Detection Prevalence 0.2946 0.6919 0.01357
## Balanced Accuracy 0.7023 0.6190 0.49255
test_accuracy <- mean(test_predictors == readyTest$ReturnDaysCategory)
cat("Accuracy on Test Set:", round(test_accuracy * 100, 2), "%\n")
## Accuracy on Test Set: 65.7 %
We can see that on the training data, the model misclassifies about 6.78% of cases, which is really good. However, on on the test data, the accuracy dropped to 65.7%. This is a sign that the model is overfitting. This might be due to the class weighing. This means that we need to take further steps to see what is going on. Mayve some further tuning can get a higher accuracy.
Cross-validation may prevent overfitting by testing the model on unseen folds during the training.
Tuning mtry may help find the best trade off between bias and variance.
Setting automated tuning may help to prevent guessing which parameters to use.
library(caret)
# making sure the numeric target is dropped from train and test sets
weighted_train_clean <- weighted_train[, !names(weighted_train) %in% "Hospital.return.days.for.pneumonia.patients"]
readyTest_clean <- readyTest[, !names(readyTest) %in% "Hospital.return.days.for.pneumonia.patients"]
set.seed(42)
# Fitting the newly tuned Random Forest Model using train() function from caret
tuned_model <- train(
ReturnDaysCategory ~ .,
data = weighted_train_clean,
method = "rf",
trControl = trainControl(
method = "cv", # cross validation to evaluate model performance
number = 5), # set 5 number of folds in the cross validation
tuneLength = 5 # try 5 different values for the 'mtry' hyperparameter
)
# Viewing the model summary
print(tuned_model)
## Random Forest
##
## 3660 samples
## 17 predictor
## 3 classes: 'Low', 'Medium', 'High'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 2928, 2928, 2927, 2929, 2928
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8907084 0.8353990
## 5 0.8874324 0.8306754
## 9 0.8800572 0.8197254
## 13 0.8781424 0.8168965
## 17 0.8732266 0.8096063
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Predict on test set
test_preds <- predict(tuned_model, newdata = readyTest_clean)
# Evaluate performance
conf_matrix <- confusionMatrix(test_preds, readyTest$ReturnDaysCategory)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Medium High
## Low 78 53 1
## Medium 63 273 45
## High 0 3 0
##
## Overall Statistics
##
## Accuracy : 0.6802
## 95% CI : (0.6381, 0.7203)
## No Information Rate : 0.6376
## P-Value [Acc > NIR] : 0.02372
##
## Kappa : 0.303
##
## Mcnemar's Test P-Value : 2.097e-08
##
## Statistics by Class:
##
## Class: Low Class: Medium Class: High
## Sensitivity 0.5532 0.8298 0.000000
## Specificity 0.8560 0.4225 0.993617
## Pos Pred Value 0.5909 0.7165 0.000000
## Neg Pred Value 0.8359 0.5852 0.910331
## Prevalence 0.2733 0.6376 0.089147
## Detection Rate 0.1512 0.5291 0.000000
## Detection Prevalence 0.2558 0.7384 0.005814
## Balanced Accuracy 0.7046 0.6261 0.496809
test_accuracy <- mean(test_preds == readyTest$ReturnDaysCategory)
cat("Accuracy on Test Set:", round(test_accuracy * 100, 2), "%\n")
## Accuracy on Test Set: 68.02 %
It looks like the accuracy increased to 68.02%. This better than before but we can see that the model is still struggling with identifying High values.
We made a Random Forest model where we classified our data into three groups, Low, Medium, and High. We used 17 predictor variables and a total of 3,660 samples. Before doing any training, we applied class weighing where we manually inputted weights based off of our own knowledge of the imbalance data.
To find the best model settings, we used 5 fold cross validation and tested different values for the parameter mtry. We tried values of 2, 5, 9, 13, and 17. In our results, we found that the model performed best when mtry was set to 2, achieving about 89% accuracy during cross validation. As mtry increased, the performance dropped slightly, meaning the model worked better when it considered fewer variables at each decision point.
When we tested model on the test data, the overall accuracy was around 68%. We can see that the model did a good job identifying the Medium class, but had a hard time identifying the High class, missing all the High cases and confusing them with the Medium group. The Low class was identified reasonably well but sometimes got mixed up with Medium too since it contained the most data. In our case, we are most interested in the Low class and Medium classes. We are wanting to provide the watchdogs with information on the number of days until a patient who was previously hospitalized for pneumonia is readmitted to the hospital. A lower value means that the patient was readmitted to the hospital soon after the discharge, in fewer days, suggesting that there was a quicker relapse and some sort of complication or unresolved issue following the pneumonia treatment. This probably means that there was inadequate recovery or some complications that occurred at that hospital.
With that being said, the model works pretty well, especially for what we are looking for with the Medium and low class. It still has trouble picking out the High category accurately which could be due to the High class being less common. If we wanted to improve on this models prefomance even more in the future, we could try some balancing techniques, adding new features, or exploring other modeling approaches.
I am looking for 1-2 markdown documents with their knitted HTML. For example, if you’re choosing Adventures 1 or 2, you may want to work through this document once, make a copy of this markdown document, move the source code to the TOP, edit the source code as needed, and then go through this again from the beginning using the new condition(s) you chose.
Alternatively, if you’re choosing Adventure 3, maybe you just choose to add on to the bottom of this document, replacing the ‘Choose Your Own Adventure’ section with your actual adventure.
Just make sure to answer the questions well,and make sure to justify the decisions you make. Tell me WHY you’re choosing the condition(s) you are in Adventures 1 or 2, or Tell me WHY you’re doing the analyses you are in Adventure 3. Other than that, make this your own learning adventure!
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
Jolliffe, Ian T. Principal Component Analysis. Springer Series in Statistics, 2002.